R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 8 hours 40 min ago

functions exercises

Sun, 2016-02-07 16:35

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Today we’re practising functions! In the exercises below, you’re asked to write short R scripts that define functions aimed at specific tasks. The exercises start at an easy level, and gradually move towards slightly more complex functions.

Answers to the exercises are available here.

If you obtained a different solution than the one posted on the answers page, please let us know of your solution by posting it as a comment at the end of that page.

Exercise 1
Create a function that will return the sum of 2 integers.

Exercise 2
Create a function what will return TRUE if a given integer is inside a vector.

Exercise 3
Create a function that given a data frame will print by screen the name of the column and the class of data it contains (e.g. Variable1 is Numeric).

Exercise 4
Create the function unique, which given a vector will return a new vector with the elements of the first vector with duplicated elements removed.

Exercise 5
Create a function that given a vector and an integer will return how many times the integer appears inside the vector.

Exercise 6
Create a function that given a vector will print by screen the mean and the standard deviation, it will optionally also print the median.

Exercise 7
Create a function that given an integer will calculate how many divisors it has (other than 1 and itself). Make the divisors appear by screen.

Exercise 8
Create a function that given a data frame, and a number or character will return the data frame with the character or number changed to NA.

Image by Ninjahatori (Own work) via Wikimedia Commons

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

ggseas package for seasonal adjustment on the fly with ggplot2

Sun, 2016-02-07 06:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

In a post a few months ago I built a new ggplot2 statistical transformation (“stat”) to provide X13-SEATS-ARIMA seasonal adjustment on the fly. With certain exploratory workflows this can be useful, saving on a step of seasonally adjusting many different slices and dices of a multi-dimensional time series during data reshaping, and just drawing it for you as part of the graphics definition process.

That code no longer works – one of the hazards of the fast changing data science environment – since the release of ggplot2 version 2.0 just before Christmas. ggplot2 v.2 introduced a whole new system of object oriented programming, ggproto, replacing the old approach that was based on proto. The good news is that the new approach is easy to use and well documented, and it’s been so successful in encouraging development based on ggplot that there’s now a whole website just to keep track of packages extending ggplot.

So it seemed time to wrap my little convenience function for seasonal adjustment into an R package which duly has its home on GitHub. Like my similar posts on this, this work depends on Christophe Sax’s {seasonal} R package, and of course the US Census Bureau’s X13-SEATS-ARIMA software.

X13-SEATS-ARIMA

Here’s how it works, applying SEATS seasonal adjustment with the default outlier etc settings to the classic international Air Passengers example dataset. The original unadjusted data are in pale grey in the background, and the seasonally adjusted line is the dark one:

# installation if not already done library(devtools) install_github("ellisp/ggseas/pkg") library(ggseas) ggplot(ap_df, aes(x = x, y = y)) + geom_line(colour = "grey80") + stat_seas(start = c(1949, 1), frequency = 12) + ggtitle("SEATS seasonal adjustment - international airline passengers") + ylab("International airline passengers per month")

Extra arguments can get passed through to X13 via the x13_params argument, which takes a list of arguments that are passed through to the “list” argument to seas():

ggplot(ap_df, aes(x = x, y = y)) + geom_line(colour = "grey80") + stat_seas(start = c(1949, 1), frequency = 12, x13_params = list(x11 = "", outlier = NULL)) + ggtitle("X11 seasonal adjustment - international airline passengers") + ylab("International airline passengers per month")

The reason for doing this with ggplot2 of course is that we can have not just univariate series, but data split by dimensions mapped to facets, colour or other aesthetics. This works very naturally, with stat_seas working just like other stats and geoms in the ggplot2 universe:

p3 <- ggplot(ldeaths_df, aes(x = YearMon, y = deaths, colour = sex)) + geom_point(colour = "grey50") + geom_line(colour = "grey50") + facet_wrap(~sex) + stat_seas(start = c(1974, 1), frequency = 12, size = 2) + ggtitle("Seasonally adjusted lung deaths in the UK 1974 - 1979") + ylab("Deaths") + xlab("(light grey shows original data;ncoloured line is seasonally adjusted)") + theme(legend.position = "none") print(p3) # note - the call to seas() isn't run until each and every time you *print* the plot More basic seasonal decomposition

I included two alternative ways of doing seasonal adjustment, more for completion than anything else. Here’s one demo with Loess-based seasonal decomposition, using the stl() function from the {stats} package:

ggplot(ap_df, aes(x = x, y = y)) + stat_stl(frequency = 12, s.window = 7)

This sort of seasonal decomposition is quicker to fit than X13-SEATS-ARIMA but not good enough for (for example) publishing official statistics. However, it might be useful in an exploratory workflow when you don’t want the computation overhead of fitting ARIMA models. This package is aiming to help is that sort of quick analysis.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Build your own neural network classifier in R

Sat, 2016-02-06 18:30

(This article was first published on Jun Ma - Data Blog, and kindly contributed to R-bloggers)

​Introduction Image classification is one important field in Computer Vision, not only because so many applications are associated with it, but also a lot of Computer Vision problems can be effectively reduced to image classification. The state of art tool in image classification is Convolutional Neural Network (CNN). In this article, I am going to write a simple Neural Network with 2 layers (fully connected). First, I will train it to classify a set of 4-class 2D data and visualize the decision bounday. Second, I am going to train my NN with the famous MNIST data (https://www.kaggle.com/c/digit-recognizer) and see its performance. The first part is inspired by CS 231n course offered by Stanford: http://cs231n.github.io/, which is taught in Python. ​Data set generation First, let’s create a spiral dataset with 4 classes and 200 examples each. library(ggplot2)library(caret) N <- 200 # number of points per classD <- 2 # dimensionalityK <- 4 # number of classesX <- data.frame() # data matrix (each row = single example)y <- data.frame() # class labels set.seed(308) for (j in (1:K)){ r <- seq(0.05,1,length.out = N) # radius t <- seq((j-1)*4.7,j*4.7, length.out = N) + rnorm(N, sd = 0.3) # theta Xtemp <- data.frame(x =r*sin(t) , y = r*cos(t)) ytemp <- data.frame(matrix(j, N, 1)) X <- rbind(X, Xtemp) y <- rbind(y, ytemp)} data <- cbind(X,y)colnames(data) <- c(colnames(X), 'label') X, y are 800 by 2 and 800 by 1 data frames respectively, and they are created in a way such that a linear classifier cannot separate them. Since the data is 2D, we can easily visualize it on a plot. They are roughly evenly spaced and indeed a line is not a good decision boundary. x_min <- min(X[,1])-0.2; x_max <- max(X[,1])+0.2y_min <- min(X[,2])-0.2; y_max <- max(X[,2])+0.2 # lets visualize the data:ggplot(data) + geom_point(aes(x=x, y=y, color = as.character(label)), size = 2) + theme_bw(base_size = 15) + xlim(x_min, x_max) + ylim(y_min, y_max) + ggtitle('Spiral Data Visulization') + coord_fixed(ratio = 0.8) + theme(axis.ticks=element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text=element_blank(), axis.title=element_blank(), legend.position = 'none') Neural network construction Now, let’s construct a NN with 2 layers. But before that, we need to convert X into a matrix (for matrix operation later on). For labels in y, a new matrix Y (800 by 4) is created such that for each example (each row in Y), the entry with index==label is 1 (and 0 otherwise). X <- as.matrix(X)Y <- matrix(0, N*K, K) for (i in 1:(N*K)){ Y[i, y[i,]] <- 1} Next, let’s build a function ‘nnet’ that takes two matrices X and Y and returns a list of 4 with W, b and W2, b2 (weight and bias for each layer). I can specify step_size (learning rate) and regularization strength (reg, sometimes symbolized as lambda).

For the choice of activation and loss (cost) function, ReLU and softmax are selected respectively. If you have taken the ML class by Andrew Ng (strongly recommended), sigmoid and logistic cost function are chosen in the course notes and assignment. They look slightly different, but can be implemented fairly easily just by modifying the following code. Also note that the implementation below uses vectorized operation that may seem hard to follow. If so, you can write down dimensions of each matrix and check multiplications and so on. By doing so, you also know what’s under the hood for a neural network. # %*% dot product, * element wise productnnet <- function(X, Y, step_size = 0.5, reg = 0.001, h = 10, niteration){ # get dim of input N <- nrow(X) # number of examples K <- ncol(Y) # number of classes D <- ncol(X) # dimensionality # initialize parameters randomly W <- 0.01 * matrix(rnorm(D*h), nrow = D) b <- matrix(0, nrow = 1, ncol = h) W2 <- 0.01 * matrix(rnorm(h*K), nrow = h) b2 <- matrix(0, nrow = 1, ncol = K) # gradient descent loop to update weight and bias for (i in 0:niteration){ # hidden layer, ReLU activation hidden_layer <- pmax(0, X%*% W + matrix(rep(b,N), nrow = N, byrow = T)) hidden_layer <- matrix(hidden_layer, nrow = N) # class score scores <- hidden_layer%*%W2 + matrix(rep(b2,N), nrow = N, byrow = T) # compute and normalize class probabilities exp_scores <- exp(scores) probs <- exp_scores / rowSums(exp_scores) # compute the loss: sofmax and regularization corect_logprobs <- -log(probs) data_loss <- sum(corect_logprobs*Y)/N reg_loss <- 0.5*reg*sum(W*W) + 0.5*reg*sum(W2*W2) loss <- data_loss + reg_loss # check progress if (i%%1000 == 0 | i == niteration){ print(paste("iteration", i,': loss', loss))} # compute the gradient on scores dscores <- probs-Y dscores <- dscores/N # backpropate the gradient to the parameters dW2 <- t(hidden_layer)%*%dscores db2 <- colSums(dscores) # next backprop into hidden layer dhidden <- dscores%*%t(W2) # backprop the ReLU non-linearity dhidden[hidden_layer <= 0] <- 0 # finally into W,b dW <- t(X)%*%dhidden db <- colSums(dhidden) # add regularization gradient contribution dW2 <- dW2 + reg *W2 dW <- dW + reg *W # update parameter W <- W-step_size*dW b <- b-step_size*db W2 <- W2-step_size*dW2 b2 <- b2-step_size*db2 } return(list(W, b, W2, b2))} ​Prediction function and model training Next, create a prediction function, which takes X (same col as training X but may have different rows) and layer parameters as input. The output is the column index of max score in each row. In this example, the output is simply the label of each class. Now we can print out the training accuracy. nnetPred <- function(X, para = list()){ W <- para[[1]] b <- para[[2]] W2 <- para[[3]] b2 <- para[[4]] N <- nrow(X) hidden_layer <- pmax(0, X%*% W + matrix(rep(b,N), nrow = N, byrow = T)) hidden_layer <- matrix(hidden_layer, nrow = N) scores <- hidden_layer%*%W2 + matrix(rep(b2,N), nrow = N, byrow = T) predicted_class <- apply(scores, 1, which.max) return(predicted_class) } nnet.model <- nnet(X, Y, step_size = 0.4,reg = 0.0002, h=50, niteration = 6000) ## [1] "iteration 0 : loss 1.38628868932674"## [1] "iteration 1000 : loss 0.967921639616882"## [1] "iteration 2000 : loss 0.448881467342854"## [1] "iteration 3000 : loss 0.293036646147359"## [1] "iteration 4000 : loss 0.244380009480792"## [1] "iteration 5000 : loss 0.225211501612035"## [1] "iteration 6000 : loss 0.218468573259166" predicted_class <- nnetPred(X, nnet.model)print(paste('training accuracy:',mean(predicted_class == (y)))) ## [1] "training accuracy: 0.96375" Decision boundary Next, let’s plot the decision boundary. We can also use the caret package and train different classifiers with the data and visualize the decision boundaries. It is very interesting to see how different algorithms make decisions. This is going to be another post.
​ # plot the resulting classifierhs <- 0.01grid <- as.matrix(expand.grid(seq(x_min, x_max, by = hs), seq(y_min, y_max, by =hs)))Z <- nnetPred(grid, nnet.model) ggplot()+ geom_tile(aes(x = grid[,1],y = grid[,2],fill=as.character(Z)), alpha = 0.3, show.legend = F)+ geom_point(data = data, aes(x=x, y=y, color = as.character(label)), size = 2) + theme_bw(base_size = 15) + ggtitle('Neural Network Decision Boundary') + coord_fixed(ratio = 0.8) + theme(axis.ticks=element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text=element_blank(), axis.title=element_blank(), legend.position = 'none') ​MNIST data and preprocessing The famous MNIST (“Modified National Institute of Standards and Technology”) dataset is a classic within the Machine Learning community that has been extensively studied. It is a collection of handwritten digits that are decomposed into a csv file, with each row representing one example, and the column values are grey scale from 0-255 of each pixel. First, let’s display an image. displayDigit <- function(X){ m <- matrix(unlist(X),nrow = 28,byrow = T) m <- t(apply(m, 2, rev)) image(m,col=grey.colors(255))} train <- read.csv("data/train.csv", header = TRUE, stringsAsFactors = F)displayDigit(train[18,-1]) Now, let’s preprocess the data by removing near zero variance columns and scaling by max(X). The data is also splitted into two for cross validation. Once again, we need to creat a Y matrix with dimension N by K. This time the non-zero index in each row is offset by 1: label 0 will have entry 1 at index 1, label 1 will have entry 1 at index 2, and so on. In the end, we need to convert it back. (Another way is put 0 at index 10 and no offset for the rest labels.) nzv <- nearZeroVar(train)nzv.nolabel <- nzv-1 inTrain <- createDataPartition(y=train$label, p=0.7, list=F) training <- train[inTrain, ]CV <- train[-inTrain, ] X <- as.matrix(training[, -1]) # data matrix (each row = single example)N <- nrow(X) # number of examplesy <- training[, 1] # class labels K <- length(unique(y)) # number of classesX.proc <- X[, -nzv.nolabel]/max(X) # scaleD <- ncol(X.proc) # dimensionality Xcv <- as.matrix(CV[, -1]) # data matrix (each row = single example)ycv <- CV[, 1] # class labelsXcv.proc <- Xcv[, -nzv.nolabel]/max(X) # scale CV data Y <- matrix(0, N, K) for (i in 1:N){ Y[i, y[i]+1] <- 1} ​Model training and CV accuracy Now we can train the model with the training set. Note even after removing nzv columns, the data is still huge, so it may take a while for result to converge. Here I am only training the model for 3500 interations. You can vary the iterations, learning rate and regularization strength and plot the learning curve for optimal fitting.​ nnet.mnist <- nnet(X.proc, Y, step_size = 0.3, reg = 0.0001, niteration = 3500) ## [1] "iteration 0 : loss 2.30265553844748"## [1] "iteration 1000 : loss 0.303718250939774"## [1] "iteration 2000 : loss 0.271780096710725"## [1] "iteration 3000 : loss 0.252415244824614"## [1] "iteration 3500 : loss 0.250350279456443" predicted_class <- nnetPred(X.proc, nnet.mnist)print(paste('training set accuracy:', mean(predicted_class == (y+1)))) ## [1] "training set accuracy: 0.93089140563888" predicted_class <- nnetPred(Xcv.proc, nnet.mnist)print(paste('CV accuracy:', mean(predicted_class == (ycv+1)))) ## [1] "CV accuracy: 0.912360085734699" Prediction of a random image Finally, let’s randomly select an image and predict the label.​ Xtest <- Xcv[sample(1:nrow(Xcv), 1), ]Xtest.proc <- as.matrix(Xtest[-nzv.nolabel], nrow = 1)predicted_test <- nnetPred(t(Xtest.proc), nnet.mnist)print(paste('The predicted digit is:',predicted_test-1 )) ## [1] "The predicted digit is: 3" displayDigit(Xtest) ​Conclusion It is rare nowadays for us to write our own machine learning algorithm from ground up. There are tons of packages available and they most likey outperform this one. However, by doing so, I really gained a deep understanding how neural network works. And at the end of the day, seeing your own model produces a pretty good accuracy is a huge satisfaction.

To leave a comment for the author, please follow the link and comment on their blog: Jun Ma - Data Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

There’s a party at Alexa’s place

Sat, 2016-02-06 14:34

(This article was first published on Blag's bag of rants, and kindly contributed to R-bloggers)

This post was originally posted on There’s a party at Alexa’s place.

This is my first blog of the year…so I want it to be something really nice and huge -:) You know how much I love the R Programming Language…but I also love other technologies as well…so taking a bunch of them and hooking them up together is what really brings me joy. Now…you may be wondering about the blog title…”There’s a party at Alexa’s place”…well…I wanted it to describe the blog in a funny way…so let’s see what we’re going to build -;) Got any idea? Basically…we’re going to use Amazon Alexa as our UI…when we ask a command…we’re going to call a NodeJS Server on Heroku (Which BTW has a PhantomJS client installed)…this NodeJS will call an R Server on Heroku (Using the Rook Server)…and this R Server is going to call HANA Cloud Platform to get some Flights information and generate nice graphics that are going to be returned to the NodeJS Server which is going to call our web browser to display the graphic generated by the R Server…of course…by using PhantomJS were going to read the generated web page on the browser and this will be sent back to Amazon Alexa so she can read out the response…interesting enough for you? I hope -:) I took me more than two weeks to get all this up and running…so you better like it -:P So…let’s go in some simple steps… GET A HANA CLOUD PLATFORM ACCOUNT You should have one already…if not…just go here to create one… Then…we need to download the HANA Cloud Platform SDK extract it and modify the file tools/neo.sh on line 57… Instead of this… javaExe=”$JAVA_HOME/bin/$javaCommand” Use this… javaExe=”$JAVA_HOME” Why? Well…it will make sense later on…or maybe it will make sense now  If you have the SAP HANA Client installed…otherwise download it from here take a note that will need to copy the ngdbc.jar file… GETTING THE DATA THAT WE’RE GOING TO USE As always…in almost all my blogs…we’re going to use tables from the Flight model…which of course…doesn’t exist on HANA Cloud Platform… The easiest way (at least for me) was to access an R/3 server…and simply download the tables as XLS files…convert them into CSV files and upload them into HCP… And BTW…for some weird reason my R/3 didn’t have American Airlines listen on the SCARR table…so I just added it -;) Now…if you don’t have access to an R/3 system…then you can download the tables in CSV format from here CREATE THE R SERVER ON HEROKU If you don’t have the Heroku Tool Belt installed…then go and grab it… Steps to install R on Heroku with Graphic Capabilities mkdkir myproject && cd myproject
mkdir bin
echo “puts ‘OK’ > config.ru
echo “source ‘http://rubygems.org’n gem ‘rack’” > Gemfile
#Open your project folder and modify the Gemfile to replace the “n” with an actual break line…
bundle install
git init . && git add . && git commit –m “Init”
heroku apps:create myproject –stack=cedar
git push heroku master
#Copy and paste the content of my installR.sh into the /bin folder of your project
git add . && git commit –am “message” && git push heroku master
heroku ps:scale web=0

installR.sh #!/bin/bash

function download() {
if [ ! -f "$2" ]; then
echo Downloading $2...
curl $1 -o $2
else
echo Got $2...
fi
}

set -e

r_version="${1:-3.2.3}"
r_version_major=${r_version:0:1}

if [ -z "$r_version" ]; then
echo "USAGE: $0 VERSION"
exit 1
fi

basedir="$( cd -P "$( dirname "$0" )" && pwd )"

# create output directory
vendordir=/app/vendor
mkdir -p $vendordir

# R
download http://cran.r-project.org/src/base/R-$r_version_major/R-$r_version.tar.gz R-$r_version.tar.gz
tar xzf R-$r_version.tar.gz

# build R
echo ============================================================
echo Building R
echo ============================================================
cd $basedir/R-$r_version/

./configure --prefix=$vendordir/R --with-blas --with-lapack --enable-R-shlib --with-readline=no --with-x=yes
make

cd /app/bin
ln -s R-$r_version/bin/R

rm R-3.2.3.tar.gz
rm -rf erb gem irb rake rdoc ri ruby testrb
rm ruby.exe

cd /app/bin/R-$r_version

rm -rf src
rm Make*
rm -rf doc
rm -rf tests
rm README ChangeLog COPYING INSTALL SVN-REVISION VERSION

Now…we need to do a very important step -:) We need to install the totally awesome heroku-buildpack-multi from ddollar.

heroku buildpacks:set https://github.com/ddollar/heroku-buildpack-multi.git

After that…we need to create a couple of files…one called .buildpacks and the other Aptfile. .buildpacks https://github.com/ddollar/heroku-buildpack-apt
https://github.com/heroku/heroku-buildpack-ruby
Aptfile gfortran
libatlas-base-dev
libatlas-dev
liblapack-dev

Now comes the interesting part…installing R -;)

Finally…installing R… git add . && git commit –am “message” && git push heroku master
heroku ps:scale web=0
heroku run bash
cd bin/
./installR.sh

With this done…we will have all the missing libraries needed to compile R on the new Cedar Stack on Heroku and also…we will have a nicely installed R instance with Graphic capabilities…but of course…we’re not done yet…

Installing the R Libraries #This will open R on Heroku…
R
#This will install the libraries with their corresponding dependencies
install.packages("Rook",dependencies=TRUE)
install.packages("Cairo",dependencies=TRUE)
install.packages("maps",dependencies=TRUE)
install.packages("forecast",dependencies=TRUE)
install.packages("plotrix",dependencies=TRUE)
install.packages("ggplot2",dependencies=TRUE)
install.packages("ggmap",dependencies=TRUE)
install.packages("rJava",dependencies=TRUE)
install.packages("RJDBC",dependencies=TRUE)
q()

All right…we’re almost there -;) The problem with Heroku is that is not writable…meaning that once you get disconnected…you will lost all your work -:(

So…we need to back it up and sent it somewhere else…I used my R Server on Amazon WebServices for this…

First…we need to compress the bin folder like this…

tar -cvzf bin.tar.gz bin

and then we need to save this file in our external server…

scp -i XXX.pem bin.tar.gz ec2-user@X.X.X.X:~/Blag/bin.tar.gz

and of course after that we need it on our project folder…so we need to send it from our external server to our project folder, where will simply would need to uncompressed it…

scp -i XXX.pem ec2-user@X.X.X.X:~/Blag/bin.tar.gz bin.tar.gz

Once this is done…we need to something real quick…

If you’re still inside the Heroku bash…simply run

which java

and take note of the location…

Now…we can close Heroku and keep going…copy the bin.tar.gz file into your project folder and extract it…then…create the following files…

config.ru `/app/bin/R-3.2.3/bin/R –e “source(‘/app/demo.R’)”`
demo.R library("Rook")
library("ggmap")
library("maptools")
library("maps")
library("Cairo")
library("RJDBC")

myPort <- as.numeric(Sys.getenv("PORT"))
#myPort <- 8000
myInterface <- "0.0.0.0"
status <- .Call(tools:::startHTTPD, myInterface, myPort)

newapp<-function(env){
req<-Rook::Request$new(env)
res<-Rook::Response$new()

pass<-system('./hcp.sh',intern=T)
jdbcDriver <- JDBC("com.sap.db.jdbc.Driver","ngdbc.jar", identifier.quote="`")
conn_server <- dbConnect(jdbcDriver, "jdbc:sap://localhost:30015", "DEV_6O3Q8EAM96G64Q8M0P5KLA1A3",pass[1])
airports_param<- dbGetQuery(conn_server, "SELECT ID FROM NEO_BLR8NOQVY6TG8KXKQJ003WVOJ.SAIRPORT")
sairport<-data.frame(IATA=airports_param$ID)
airports<-read.csv("airports.dat", header = FALSE)
colnames(airports)<-c("ID", "name", "city", "country", "IATA", "ICAO", "lat", "lon")
airports<-subset(airports, IATA %in% sairport$IATA)
keeps <- c("city","country","IATA","lat","lon")
airports<-airports[keeps]

if(req$params()$airports != '')
{
count_sairport<-nrow(airports)
mp <- NULL
mapWorld <- borders("world", colour="gray50")
mp <- ggplot() + mapWorld
CairoJPEG(filename = "R_Plot.jpg", width = 680, height = 680)
mp <- mp+ geom_point(aes(x=airports$lon, y=airports$lat) ,color="red", size=3)
show(mp)
dev.off()
}else if(req$params()$us_airports != '')
{
US_Airports<-airports[airports$country == "United States", ]
count_sairport<-nrow(US_Airports)
mp <- NULL
mapWorld <- borders("state", colour="gray50")
mp <- ggplot() + mapWorld
mp <- mp+geom_point(aes(x=US_Airports$lon, y=US_Airports$lat) ,color="red", size=3)+
geom_text(data=US_Airports, aes(US_Airports$lon, US_Airports$lat, label = US_Airports$city), size=6)
CairoJPEG(filename = "R_Plot.jpg", width = 680, height = 680)
show(mp)
dev.off()
}else if(req$params()$carriers != '')
{
US_Airports<-airports[airports$country == "United States", ]
US_Airports<-as.vector(t(US_Airports))
US_Airports<-paste(shQuote(US_Airports), collapse=", ")
#query<-paste("SELECT CARRID, DISTANCE FROM NEO_BLR8NOQVY6TG8KXKQJ003WVOJ.SPFLI WHERE AIRPFROM IN (",US_Airports,")")
query<-paste("SELECT SPFLI.CARRID,CARRNAME,DISTANCE FROM NEO_BLR8NOQVY6TG8KXKQJ003WVOJ.SPFLI
INNER JOIN NEO_BLR8NOQVY6TG8KXKQJ003WVOJ.SCARR ON SPFLI.CARRID = SCARR.CARRID
WHERE AIRPFROM IN (",US_Airports,")")
result<-dbGetQuery(conn_server,query)
result$DISTANCE<-sub(",\d+","",result$DISTANCE)
result$DISTANCE<-sub("\.","",result$DISTANCE)
carriers<-data.frame(CARRID=result$CARRID,CARRNAME=result$CARRNAME,
DISTANCE=as.numeric(result$DISTANCE),stringsAsFactors = FALSE)
carriers_sum<-aggregate(DISTANCE~CARRID + CARRNAME,data=carriers,FUN=sum)
cols<-c('CARRNAME','DISTANCE')
data <- apply( carriers_sum[ , cols ] , 1 , paste , collapse = " " )
count_sairport<-paste(shQuote(data), collapse=", ")
mp <- NULL
CairoJPEG(filename = "R_Plot.jpg", width = 680, height = 680)
mp<-ggplot(carriers_sum,aes(x=CARRID,y=DISTANCE,fill=CARRID))+
geom_bar(position="dodge",stat="identity")
show(mp)
dev.off()
}

size <- as.integer(system("stat --format %s R_Plot.jpg", intern=T))
to.read = file("R_Plot.jpg", "rb")
#x<-readBin(to.read, raw(),n=18674)
x<-readBin(to.read, raw(),n=size)
hex<-paste(x, collapse = "")
hex<-paste(hex,count_sairport,sep = "/")
close(to.read)

res$write(hex)
res$finish()
}

unlockBinding("httpdPort", environment(tools:::startDynamicHelp))
assign("httpdPort", myPort, environment(tools:::startDynamicHelp))

server = Rhttpd$new()
server$listenAddr <- myInterface
server$listenPort <- myPort
server$add(app = newapp, name = "summarize")

while(T) {
Sys.sleep(10000)
}

So…let’s take some time to understand what’s going on with this code…we’re going to create a Rook server…which will allow us to host webpages from R…then, we’re going to use our hcp.sh script to get the password for our HANA Cloud Platform bridge…so we can get an JDBC connection to the database…from there we want to get a list of all the airports and also read the airports from a file detailed later (this airports file contains the geolocation of the airports). With this…we want to filter out the airports from HANA with the airports from the flight…so we don’t have any extra data…now…we have three choices…airports, US airports or carriers…the first one will generate a map of the world with all the airports as little red dots…the second one will generate a map of the US with the airports as little red dots but also showing the name of the cities…the last one will generate a geometric histogram with the details of the flights distance according to their carriers…later on…we’re going to read the information of the generated graphic to create a hexadecimal string of the graphic along with some information that Alexa should spell out…easy as cake, huh? Procfile web: bundle exec rackup config.ru

We want this R Server to be able to access HANA Cloud Platform…so let’s do that before we keep going…

With the location of Java…apply this command…

heroku config:set JAVA_HOME=’/usr/bin/java’

Now…Copy the following files into your project folder…

“tools” folder from the HANA Cloud Platform SDK
ngdbc.jar from SAP HANA Client

Also…create this little script which is going to allow us to connect to HCP… hcp.sh #!/bin/bash -e

user=i830502
account=i830502trial
HCP_PASSWORD=Kiarita2018

json=$(tools/./neo.sh open-db-tunnel -i blaghanax -h hanatrial.ondemand.com -a  "$account" -u "$user" -p "$HCP_PASSWORD" --background --output json)
regex='.*"host":"([^"]*)".*"port":([^,]*),.*"instanceNumber":"([^"]*)". *"dbUser":"([^"]*)".*"dbUserPassword":"([^"]*)".*"sessionId":"([^"]*)".*'
[[ $json =~ $regex ]]
dbHost=${BASH_REMATCH[1]}
dbPort=${BASH_REMATCH[2]}
dbInstance=${BASH_REMATCH[3]}
dbUser=${BASH_REMATCH[4]}
dbPassword=${BASH_REMATCH[5]}
tunnelSessionId=${BASH_REMATCH[6]}

echo $dbPassword

Awesome…we’re almost there…please download this airports file and place it on your project folder…

Finally…with all files in place we can send everything back to heroku…so log into your Heroku tool belt and do the following…

Getting everything in place git add .
git commit –am “message”
git push heroku master
heroku ps:scale web=1

R should be installed and ready to go -;) So we should move on and continue with our NodeJS server on Heroku…

Creating NodeJS on Heroku with PhantomJS

Create a folder called mynodeproject and inside create the following files… package.json {
"dependencies": {
"ws": "1.0.1",
"request": "2.67.0",
"express": "4.13.3",
"phantomjs-prebuilt": "2.1.3"
},
"engines": {
"node": "0.12.7"
}
}
Procfile web: node index.js index.js var WebSocketServer = require("ws").Server
, http = require("http")
, express = require("express")
, request = require('request')
, fs = require('fs')
, app = express()
, arr = []
, msg = ""
, port = process.env.PORT || 5000
, childProcess = require('child_process')
, phantomjs = require('phantomjs-prebuilt')
, path = require('path')
, binPath = phantomjs.path;

app.use(express.static(__dirname + "/"))

var server = http.createServer(app)
server.listen(port)

var wss = new WebSocketServer({server: server})

var childArgs = [path.join(__dirname, 'phantom.js')]
var childStats = [path.join(__dirname, 'readphantom.js')]

app.get('/path', function (req, res) {
if(req.query.command == 'map'){
URL = "http://blagrookheroku.herokuapp.com/custom/summarize?airports=xyz&us_airports=&carriers=";
request(URL, function (error, response, body) {
if (!error) {
arr = body.split("/");
msg = "There are " + arr[1] + " airports around the world";
var bitmap = new Buffer(arr[0], 'hex');
var jpeg = new Buffer(bitmap,'base64');
fs.writeFileSync('Graph.jpg', jpeg);
res.redirect('/');
};
});
}else if(req.query.command == 'usmap'){
URL = "http://blagrookheroku.herokuapp.com/custom/summarize?airports=&us_airports=xyz&carriers=";
request(URL, function (error, response, body) {
if (!error) {
arr = body.split("/");
msg = "There are " + arr[1] + " airports in the US";
var bitmap = new Buffer(arr[0], 'hex');
var jpeg = new Buffer(bitmap,'base64');
fs.writeFileSync('Graph.jpg', jpeg);
res.redirect('/');
};
});
}else if(req.query.command == 'carriers'){
URL = "http://blagrookheroku.herokuapp.com/custom/summarize?airports=&us_airports=&carriers=xyz";
request(URL, function (error, response, body) {
if (!error) {
arr = body.split("/");
msg = "" + arr[1];
var bitmap = new Buffer(arr[0], 'hex');
var jpeg = new Buffer(bitmap,'base64');
fs.writeFileSync('Graph.jpg', jpeg);
res.redirect('/');
};
});
}else if(req.query.command == 'stat') {
childProcess.execFile(binPath, childArgs, function(err, stdout, stderr){
if(!err){
res.redirect('/');
};
});
}else if(req.query.command == 'readstat') {
childProcess.execFile(binPath, childStats, function(err, stdout, stderr){
if(!err){
res.write(stdout);
res.end();
};
});
}else if(req.query.command == 'bye'){
if(fs.existsSync('Graph.jpg')){
fs.unlink('Graph.jpg');
}
res.redirect('/');
}
});

wss.on("connection", function(ws) {
var id = setInterval(function() {

fs.readFile('Graph.jpg', function(err, data) {
if(!err){
ws.send(JSON.stringify("Graph.jpg/" + msg), function() { })
}else{
ws.send(JSON.stringify("Gandalf.jpg/No problem...I'm crunching your data..."), function() { })
}
});
}, 3000)

ws.on("close", function() {
clearInterval(id)
})
})

Let’s explain the code for a little bit and believe me…I’m far from being a NodeJS expert…this is really the first time I develop something this complex…and it took me a really long time and tons of research…so please try not to criticize me too much -:( We’re going to create a express application that uses Web Sockets in order to refresh the browser in order to show the graphics generated by our R Server…it will also call PhantomJS to both create and read the generated web page so we can send it back to Alexa… Here…we have six choices…map, usmap and carriers…the first three are going to call our R Server passing all parameters but leaving the ones that we don’t need empty…and just passing “xyz” as parameter… When we got the response from R it’s going to be a long string separated by an “/”…which is going to be the hexadecimal string for the graphic along with the text intended for Alexa…Node will read the graphic…generated it and then refresh the browser in order to show it on the screen… The stats option will call our PhantomJS script to simply read the page and create a new file with the Javascript part already executed…the readstat will read this information and extract the text that we need for Alexa…finally…bye will delete the graphic and the web socket will call the main graphic to be displayed on the screen. Finally…the web socket is going to constantly check…every 3 seconds to see if there’s a graphic or not…and the display the related image… index.html <html>
<head>
<title>I'm a NodeJS Page!</title>
<div id="container" align="center"/>
<script>
var host = location.origin.replace(/^http/, 'ws')
var ws = new WebSocket(host);
ws.onmessage = function (event) {
var container = document.getElementById('container');
var data = JSON.parse(event.data);
data = data.split("/");
var url = data[0];
var msg = data[1];

container.innerHTML = '<img src="' + url + '"></br><p><b>' + msg + '</b></p>';
};
</script>
</head>
<body>
</body>
</html>

This one is going to be called by our express application and it will simply call the web socket to determine what it needs to display…it has some Javascript…that’s why we need PhantomJS to interact with it…

phantom.js var page = require('webpage').create();
var fs = require('fs');
page.open('http://blagnodeheroku.herokuapp.com/', function () {
window.setTimeout(function () {
page.evaluate(function(){

});

fs.write('stats.html', page.content, 'w');
phantom.exit();
},4000);
});

Not the best and most describing name…but who cares -:P Anyway…this script will load the page…that is the express application…wait for 4 seconds for the Javascript to get generated and then create a web page called stats.html


readphantom.js var page = require('webpage').create(),
address = "stats.html";

page.open(address, function (status) {
if (status == 'success') {
var results = page.evaluate(function() {
return document.querySelector('p').innerText.trim();
});
console.log(results);
phantom.exit();
}
});

This script will simply read the stats.html page and return the text that it’s located inside the “p” tag…dead simple…

SETTING UP ALEXA

Creating the Lambda function

Now…we need to setup Alexa…so we can control everything via voice commands -:) First…we need to go to Amazon Lambda and log in if you have an account…otherwise…please create one…and make sure you’re on the West Virginia region…

In the list of functions…look for color… Choose the NodeJS one…Python has been included as well…but wasn’t when I started to work on this blog -:) Here, just click next…. I already create the function…but you shouldn’t have a problem… Basic execution role is more than enough… This will provide a pop up window…simply press the “Allow” button and then “Create function”…we will include the source code later on…but notice the ARN generated number…because we’re going to need it on the next step… Creating the Skill Go to http://developer.amazon.com and log in…then choose Apps & Services –> Alexa –> Alexa Skills Set Choose Alexa Skills Kit and fill the blanks… As soon as we hit next an application number will be generated on a new field called Application ID. Grab this number as we’re going to need it for our application code. The Interaction Model section is very important as here we’re going to define the “Intent Schema” and “Sample Utterances”…the first will define the parameters that we’re going to send to Alexa and the second is how we are going to call our application. Intent Schema {
"intents": [
{
"intent": "GetFlightsIntent",
"slots": [
{
"name": "command",
"type": "LITERAL"
}
]
},
{
"intent": "HelpIntent",
"slots": []
}
]
}

Our variable is going to be called “command” and it’s going to be a LITERAL…other types are NUMBER, DATE, TIME and DURATION. The intent is the method that we’re going to call in our code… Sample Utterances GetFlightsIntent airports from {around the world|command}
GetFlightsIntent airports from {united states|command}
GetFlightsIntent flight distance from {carriers|command}
GetFlightsIntent {thank you|command}

The test section help us to say commands and see how Alexa responds…but we’re not going to do that here…we’re going to test it using a real Alexa device -;)

Forget about the Publishing Information section unless you really want to publish your application…
Create a folder call Flights…Alexa_Party…or whatever you fancy…then create a folder call src and copy this file in there…calling it AlexaSkills.js

We’re going to need to install only one library….”request”…

sudo npm install –prefix=~/Flights/src request

This will create a folder called “node_modules” with the package in our project folder…then create a file called “index.js” and copy and paste the following code…

index.js var request = require("request")
, AlexaSkill = require('./AlexaSkill')
, APP_ID = 'amzn1.echo-sdk-ams.app.8c0bd993-723f-4ab2-80b5-84402a7a59ce';

var error = function (err, response, body) {
console.log('ERROR [%s]', err);
};

var getJsonFromFlights = function(command, callback){
var msg = "";
if(command == "thank you"){
request("http://blagnodeheroku.herokuapp.com/path/?command=bye", function (error, response, body) {
if (!error) {
console.log("Done");
};
});
setTimeout(function() {
callback("thank you");
},2000);
}else if (command == "around the world"){
request("http://blagnodeheroku.herokuapp.com/path/?command=bye", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=map", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=stat", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=readstat", function (error, response, body) {
if (!error) {
msg = body;
};
});
};
});
};
});
}
});
setTimeout(function() {
callback(msg.trim());
},15000);
}else if (command == "united states"){
request("http://blagnodeheroku.herokuapp.com/path/?command=bye", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=usmap", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=stat", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=readstat", function (error, response, body) {
if (!error) {
msg = body;
};
});
};
});
};
});
}
});
setTimeout(function() {
callback(msg.trim());
},15000);
}else if (command == "carriers"){
request("http://blagnodeheroku.herokuapp.com/path/?command=bye", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=carriers", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=stat", function (error, response, body) {
if (!error) {
request("http://blagnodeheroku.herokuapp.com/path/?command=readstat", function (error, response, body) {
if (!error) {
msg = body;
};
});
};
});
};
});
}
});
setTimeout(function() {
callback(msg.trim());
},15000);
}
};
var handleFlightsRequest = function(intent, session, response){
getJsonFromFlights(intent.slots.command.value, function(data){
if(data != "thank you"){
var text = data;
var reprompt = 'Please say a command?';
response.ask(text, reprompt);
}else{
response.tell("You're welcome");
}
});
};

var Flights = function(){
AlexaSkill.call(this, APP_ID);
};

Flights.prototype = Object.create(AlexaSkill.prototype);
Flights.prototype.constructor = Flights;

Flights.prototype.eventHandlers.onSessionStarted = function(sessionStartedRequest, session){
console.log("onSessionStarted requestId: " + sessionStartedRequest.requestId
+ ", sessionId: " + session.sessionId);
};

Flights.prototype.eventHandlers.onLaunch = function(launchRequest, session, response){
// This is when they launch the skill but don't specify what they want.
var output = 'Welcome to Flights. ' +
'Please, say a command.';

var reprompt = 'Please, say a command?';

response.ask(output, reprompt);

console.log("onLaunch requestId: " + launchRequest.requestId
+ ", sessionId: " + session.sessionId);
};

Flights.prototype.intentHandlers = {
GetFlightsIntent: function(intent, session, response){
handleFlightsRequest(intent, session, response);
},

HelpIntent: function(intent, session, response){
var speechOutput = 'Get the information for airports and flights. ' +
'Please say a command?';
response.ask(speechOutput);
}
};

exports.handler = function(event, context) {
var skill = new Flights();
skill.execute(event, context);
};

Time to explain what I was trying to do here -:P

The handleFlightsRequest method will manage the response that Alexa will spell out for us…and inside this method we can find getJsonFromFlights which will take the command defined in the our Intent Schema. This function will call our NodeJS server for the following commands…”thank you” will simply call the bye command….”around the world” will call the bye, map, stat and readstat commands…”united states” will call the bye, usmap, stat and readstat commands…finally carriers will call the bye, carriers, stat and readstat commands…

After 15 seconds (Yep…I know it’s too much but there are a lot of processes going on) Alexa will get the response message and simply speak it to us -;)

That’s pretty much it…now…I can show some images before we jump into the video…

Ok…enough…let’s watch this in real live action

Greetings,

Blag.
Development Culture.

To leave a comment for the author, please follow the link and comment on their blog: Blag's bag of rants. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Hadley Wickham’s Advanced R in Amsterdam

Sat, 2016-02-06 14:24

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

On May 19 and 20, 2016, Hadley Wickham will teach his two day Master R Developer Workshop in the centrally located European city of Amsterdam.

Are you ready to upgrade your R skills?  Register soon to secure your seat.

For the convenience of those who may travel to the workshop, it will be held at the Hotel NH Amsterdam Schiphol Airport.

Hadley teaches a few workshops each year and this is the only one planned for Europe. They are very popular and hotel rooms are limited. Please register soon.

We look forward to seeing you in the month of May!

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Mastering R plot – Part 2: Axis

Sat, 2016-02-06 11:42

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

This is the second part of the Mastering R plot series.

The standard plot function in R allows extensive tuning of every element being plotted. There are, however, many possible ways and the standard help file are hard to grasp at the beginning. In this article we will see how to control every aspects of the axis (labels, tick marks …) in the standard plot function.

Axis title and labels

Create some data and create a plot with default settings.

#some data x<-1:100 y<-runif(100,-2,2) #a usual plot with per default settings plot(x,y) #changing the axis title is pretty straightforward plot(x,y,xlab="Index",ylab="Uniform draws")

Here is the plot:

The settings of the plot are usually controlled by the par function (see ?par for the many possible arguments), once the arguments are set in par they apply to all subsequent plots. Some arguments in par (for example cex.axis) can also be set in other plot functions like axis or text. When these arguments are set in these other functions they will then apply only to the current plot. One can then control if he/she wants all plots to be affected by the change or only the current one.

#change the sizes of the axis labels and axis title op<-par(no.readonly=TRUE) #this is done to save the default settings par(cex.lab=1.5,cex.axis=1.3) plot(x,y,xlab="Index",ylab="Uniform draws") #if we want big axis titles and labels we need to set more space for them par(mar=c(6,6,3,3),cex.axis=1.5,cex.lab=2) plot(x,y,xlab="Index",ylab="Uniform draws")

Here is the plot:

A handy function to gain deeper control into the axis is the axis function which can control among other things at which values the tick marks are drawn, what axis labels to put under the tick marks, the line type and width of the axis line, the width of the tick marks, the color of the tick marks and axis line.

#we can further control the axis using the axis function par(op) #re-set the plot to the default settings plot(x,y,xaxt="n") #this prevent the plotting of x-axis labels axis(side=1,at=c(5,50,100)) #force the tick marks to be drawn at x=5, 50 and 100 #one can also specify the labels at the tick marks plot(x,y,yaxt="n") axis(side=2,at=c(-2,0,2),labels=c("Small","Medium","Big")) #the axis function also control the axis line and tick marks plot(x,y) axis(side=3,at=c(5,25,75),lwd=4,lwd.ticks=2,col.ticks="red") #some time you may want to remove the box around the plot and only show the axis lines plot(x,y,bty="n",xaxt="n",yaxt="n") axis(side=1,at=seq(0,100,20),lwd=3) axis(side=2,at=seq(-2,2,2),lwd=3)

Here is the plot:

Also note that an R plot has four sides, starting on the bottom and going clockwise (ie side=3 correspond to the top of the graph).

Tick marks

Let’s turn now to the tick marks, they can also be controlled either from par or from axis.

#tick marks finer control goes through par or axis par(tcl=0.4,mgp=c(1.5,0,0)) #tcl control the length of the tick marks #positive values will make the tick being drawn inside the plot #negative values will make tick go outside #mgp takes three values, the first one control how much line between plot and axis title #the second between plot and axis labels and the third between plot and axis line plot(x,y) #another example using axis par(op) plot(x,y,xaxt="n",yaxt="n",xlab="",ylab="") axis(side=1,at=seq(5,95,30),tcl=0.4,lwd.ticks=3,mgp=c(0,0.5,0)) mtext(side=1,text="X axis",line=1.5) #cannot set the axis title with axis so need to use mtext axis(side=2,at=seq(-2,2,2),tcl=0.3,lwd.ticks=3,col.ticks="orange",mgp=c(0,0,2)) mtext(side=2,text="Numbers taken randomly",line=2.2)

Here is the plot:

Here we saw a third additional function, mtext, which allow one to write text outside of the plotting area (I guess that mtext stands for “margin text”). We have to specify how far from the plotting region one wants to write the text with the line argument. This concept of lines is important to understand how to control spaces around the plotting region. We controlled it earlier in this post when I used the mar argument which sets how many lines are available on each sides of the plot. Let’s look at this in more details:

#understanding the lines plot(1:10,1:10,xlab="",ylab="",xaxt="n",yaxt="n") for(i in 0:4){ mtext(side=1,text=paste0("Line ",i),line=i) } for(i in 0:3){ mtext(side=2,text=paste0("Line ",i),line=i) } #of course this can be changed with the mar argument in par par(mar=c(7,2,2,2)) plot(1:10,1:10,xlab="",ylab="",xaxt="n",yaxt="n") for(i in 0:6){ mtext(side=1,text=paste0("Line ",i),line=i) } for(i in 0:1){ mtext(side=2,text=paste0("Line ",i),line=i) }

Here is the plot:

From this last graph it is easy to grasp that on each side of the plot there are a certain number of lines, have a look at par()$mar for the default numbers. Using this, one can control how much space to create around the plot but also where to write axis labels or titles. Next time we’ll extend this concept of margins by talking about the outer margins from the plot, until then happy plotting!

To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

What has Kaggle learned from 2 million machine learning models?

Fri, 2016-02-05 19:36

(This article was first published on R – Data Science Los Angeles, and kindly contributed to R-bloggers)

What has Kaggle learned from 2 million machine learning models?

Anthony Goldbloom, founder and CEO of Kaggle

UPDATE: Video available here

Kaggle is a community of almost 450K data scientists who have built almost 2MM machine learning models to participate in our competitions. Data scientists come to Kaggle to learn, collaborate and develop the state of the art in machine learning. This talk will cover some of the lessons on winning techniques we have learned from the Kaggle community.

Speaker bio: Anthony Goldbloom is the founder and CEO of Kaggle. In 2011 & 2012, Forbes Magazine named Anthony as one of the 30 under 30 in technology, in 2013 the MIT Tech Review named him one of top 35 innovators under the age of 35 and the University of Melbourne awarded him an Alumni of Distinction Award. He holds a first call honors degree in Econometrics from the University of Melbourne. Anthony has published in the The Economist and the Harvard Business Review.

Date: February 4, 2016

Timeline:
– 6:00pm arrival, food/drinks and networking
– 6:30pm talk starts

You must have a confirmed RSVP and please arrive by 6:25pm the latest. Please RSVP here on Eventbrite.

Venue: Hulu, 2500 Broadway 2nd Floor, Santa Monica, CA 90404

Thanks Hulu for hosting!

Parking: There is some limited street parking around. There is a parking garage underneath, but tickets will not be validated.

 

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Los Angeles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Pitfall of XML package: to know the cause

Fri, 2016-02-05 19:26

(This article was first published on R – ЯтомизоnoR, and kindly contributed to R-bloggers)

This is the sequel to the previous report “issues specific to cp932 locale, Japanese Shift-JIS, on Windows“.  In this report, I will dig the issues deeper to find out what is exactly happening.

1. Where it occurs

I knew the issues depend node texts, because a very same script run on Windows to parse another table in the same html source.

# Windows src <- 'http://www.taiki.pref.ibaraki.jp/data.asp' t2 <- iconv(as.character( readHTMLTable(src, which=4, trim=T, header=F, skip.rows=2:48, encoding='shift-jis')[1,1] ), from='utf-8', to='shift-jis') > t2 # bad [1] NA s2 <- iconv(as.character( readHTMLTable(src, which=6, trim=T, header=F, skip.rows=1, encoding='shift-jis')[2,2] ), from='utf-8', to='shift-jis') > s2 # good [1] "北茨城中郷"

To know the difference of the two html parts is to know where the issue occurs.

Let’s see the html source by primitive functions, instead of using the package XML.

con <- url(src, encoding='shift-jis') x <- readLines(con) close(con)

I know two useful keywords to locate points of t2 and s2 above; 2016 and 101 respectively.

# for t2 > grep('2016', x) [1] 120 133 141 148 160 161 > x[119:121] [1] "ttt<td class="title">" [2] "tttt最新の観測情報&nbsp;&nbsp;(2016年1月17日&nbsp;&nbsp;8時)" [3] "ttt</td>" # for s2 > grep('101', x) [1] 181 > x[181:182] [1] "tttttt<td class="td1">101</td>" [2] "tttttt<td class="td2">北茨城中郷</td>"

Note that only x[182] is for s2 and x[181] was used to find the position.  Apparently, differences between t2 and s2 are:

  1. t2 includes &nbsp;.
  2. t2 spreads over multiple lines.

Because I want to know the exact content of a single node (html element), the three lines of t2 must be joined together.

paste(x[119:121], collapse='rn')

Pasting with a newline code rn may be the answer, but more exact procedure is better.

Binary functions are elegant tools which can handle an html source as binary data as is offered by the web server regardless of client platforms and locales.

con <- url(src, open='rb') skip <- readBin(con, what='raw', n=5009) xt2 <- readBin(con, what='raw', n=92) skip <- readBin(con, what='raw', n=3013) xs2 <- readBin(con, what='raw', n=31) close(con)

In this time I cheated the byte position of interests from the prior result x.

# t2 begins after > sum(nchar(x[1:118], type='bytes') + 2) + nchar(sub('<.*$', '', x[119]), type='bytes') [1] 5009 # t2 length > sum(nchar(x[119:121], type='bytes') + 2) - 2 - nchar(sub('<.*$', '', x[119]), type='bytes') [1] 92 # s2 begins after > sum(nchar(x[1:181], type='bytes') + 2) + nchar(sub('<.*$', '', x[182]), type='bytes') [1] 8114 # s2 length > nchar(x[182], type='bytes') - nchar(sub('<.*$', '', x[182]), type='bytes') [1] 31 # s2 from the end of t2 > 8114 - 5009 - 92 [1] 3013

Variables xt2 and xs2 have what I want as binary (raw) vector.

# Windows > rawToChar(xt1) [1] "<td class="title">rntttt最新の観測情報&nbsp;&nbsp;(2016年1月17日&nbsp;&nbsp;8時)rnttt</td>" > rawToChar(xs1) [1] "<td class="td2">北茨城中郷</td>"

Compare inside texts of these nodes.

t2: rntttt最新の観測情報&nbsp;&nbsp;(2016年1月17日&nbsp;&nbsp;8時)rnttt s2: 北茨城中郷

Maybe, a text including control codes (r, n, t) and/or html entities (&nbsp;) is unsafe, and a text made up of printable Kanji characters only is safe.  So, the issues must occur at these special characters.

Before digging more, I want to introduce two nice binary functions that can be used without cheating of byte position.  Function getBinaryURL in package RCurl can get the whole binary data from a web page.  Function grepRaw can locate positions of specific character in binary vector.

library(RCurl) x <- getBinaryURL(src) > str(x)  raw [1:35470] 0d 0a 3c 68 ... > grepRaw('2016', x, all=FALSE) [1] 5062 > x[5000:5100]   [1] 09 3c 74 72 3e 0d 0a 09 09 09 3c 74 64 20 63 6c 61 73 73 3d 22 74 69 74  [25] 6c 65 22 3e 0d 0a 09 09 09 09 8d c5 90 56 82 cc 8a cf 91 aa 8f ee 95 f1  [49] 26 6e 62 73 70 3b 26 6e 62 73 70 3b 81 69 32 30 31 36 94 4e 31 8c 8e 31  [73] 37 93 fa 26 6e 62 73 70 3b 26 6e 62 73 70 3b 38 8e 9e 81 6a 0d 0a 09 09  [97] 09 3c 2f 74 64 > rawToChar(x[5000:5100]) [1] "t<tr>rnttt<td class="title">rntttt最新の観測情報&nbsp;&nbsp;(2016年1月17日&nbsp;&nbsp;8時)rnttt</td" 2. What is cause

Let’s check out what happens when a html has html entity (&nbsp;) or spaces (r n t).  I’m going to use a minimum html to compare responses of package XML on Mac and Windows.

library(XML) > xmlValue(xmlRoot(htmlParse('<html>ABC</html>', asText=T))) [1] "ABC" 2-1. No-Break Space (U+00A0, &nbsp;) # Mac > xmlValue(xmlRoot(htmlParse( '<html>&nbsp;</html>', asText=T))) [1] " " # good > iconv(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;</html>', asText=T))), from='utf-8', to='shift-jis') [1] NA # bad > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;</html>', asText=T)))) [1] c2 a0 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;あ</html>', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;xe3x81x82</html>', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;x82xa0</html>', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # good # Windows > xmlValue(xmlRoot(htmlParse( '<html>&nbsp;</html>', asText=T))) [1] "ツxa0" # nonsense; putting utf-8 characters on shift-jis terminal > iconv(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;</html>', asText=T))), from='utf-8', to='shift-jis') [1] NA # bad > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;</html>', asText=T)))) [1] c2 a0 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;あ</html>', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;xe3x81x82</html>', asText=T, encoding='utf-8')))) [1] c2 a0 e3 81 82 # good > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>&nbsp;x82xa0</html>', asText=T, encoding='shift-jis')))) [1] c2 a0 e3 81 82 # good

As shown above, function xmlValue always returns a utf-8 string and the result is exactly same on both Mac and Windows, regardless of the difference of locales.  An &nbsp; is converted to a u+00a0 (xc2xa0 in utf-8).  An error occurs when iconv converts utf-8 characters into shift-jis on both Mac and Windows.  So, this is not an issue of xmlValue.

The issue can be simplified into an issue of iconv.

# Mac and Windows > iconv('u00a0', from='utf-8', to='shift-jis', sub='byte') [1] "<c2><a0>" # bad

As the above, function iconv fails converting u+00a0 into shift-jis.  Because Mac people usually do not convert characters into shift-jis, the issue is specific to Windows.

Perhaps I found a background of the cause.  According to a list of JIS X 0213 non-Kanji at wikipedia, No-Break Space was not defined in JIS X 0208 and added at JIS X 0213 in year 2004.  This means u+00a0 is included in the latest extended shift-jis (shift_jis-2004), but not in the conventional shift-jis.  Because Windows code page 932 (cp932) was defined after the conventional shift-jis (JIS X 0208), cp932 is not compatible with JIS X 0213.  In contrast, Mac uses shift_jis-2004 (JIS X 0213).

# Mac > charToRaw(iconv('u00a0', from='utf-8', to='shift_jisx0213', sub=' ')) [1] 85 41 # good

When the explicit version of shift-jis is specified, iconv successfully converts u+00a0 into shift_jis-2004.

But, Windows fails with the message:

unsupported conversion from 'utf-8' to 'shift_jisx0213' in codepage 932.

Actually, the issue is not of iconv, but of differences between versions of JIS code.

2-2. trim

In the following tests, a Japanese Hiragana character “あ”, binary code of that is “e3 81 82” for utf-8 and is “82 a0” for shift-jis, was used.

# Mac > xmlValue(xmlRoot(htmlParse( '<html>a</html>', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii > xmlValue(xmlRoot(htmlParse( '<html>ta</html>', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii, trim > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>あ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>tあ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=F)) [1] 09 e3 81 82 # good. shift-jis, trim=FALSE > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>tあ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tx82xa0</html>', asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>atあ</html>', from='utf-8', to='shift-jis'), asText=T, encoding='shift-jis')), trim=T)) [1] 61 09 e3 81 82 # good. shift-jis, trim=TRUE but is not required > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tあ</html>', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>txe3x81x82</html>', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. #utf-8, trim # Windows > xmlValue(xmlRoot(htmlParse( '<html>a</html>', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii > xmlValue(xmlRoot(htmlParse( '<html>ta</html>', asText=T, encoding='shift-jis')), trim=T) [1] "a" # good. ascii, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>あ</html>', asText=T, encoding='shift-jis')), trim=T)) [1] e3 81 82 # good. shift-jis > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tあ</html>', asText=T, encoding='shift-jis')), trim=F)) [1] 09 e3 81 82 # good. shift-jis, trim=FALSE > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tあ</html>', asText=T, encoding='shift-jis')), trim=T)) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>tx82xa0</html>', asText=T, encoding='shift-jis')), trim=T)) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad. shift-jis, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>atあ</html>', asText=T, encoding='shift-jis')), trim=T)) [1] 61 09 e3 81 82 # good. shift-jis, trim=TRUE but is not required > charToRaw(xmlValue(xmlRoot(htmlParse(iconv( '<html>tあ</html>', from='shift-jis', to='utf-8'), asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim > charToRaw(xmlValue(xmlRoot(htmlParse( '<html>txe3x81x82</html>', asText=T, encoding='utf-8')), trim=T)) [1] e3 81 82 # good. utf-8, trim

Mac passed all tests.

In contrast, a bad case was found in Windows, when the text consisted of Japanese characters (あ) and space characters (t), when an option trim=TRUE was specified, and when removal of space characters were truly required.  The result was something unreadable.  Ascii and utf-8 encodings were safe.

A point here may be the difference of regular expression by locales; utf-8 for Mac and shift-jis for Windows.

# Mac > charToRaw(gsub('\s', '', 'tあ')) [1] e3 81 82 # good # Windows charToRaw(gsub('\s', '', iconv('tあ', from='shift-jis', to='utf-8'))) [1] e7 b8 ba e3 a0 bc e3 b8 b2 # bad

This result matches exactly with the tests of xmlValue.  So, what the trim=TRUE in the package XML is doing may be using R’s regular expression (depends on locale) to the internal string (always utf-8).  Because the regular expression working on Japanese Windows is safe to the national locale (cp932), it is unsafe to the international locale (utf-8).

Additionally, the result of the utf-8 trim test was good in Windows.  This indicates there’re some procedures to handle locale differences in package XML and the bad case slips through by mistake.

3. Another way

Thanks to Jeroen for the comment.  CRAN package xml2 is free from the 2nd issue.

# Windows > charToRaw(xml_text(read_xml( '<html>tあ</html>', encoding='shift-jis'), trim=T)) [1] e3 81 82 # good. shift-jis, trim=TRUE

Its trim=TRUE is working good on Windows with shift-jis encoded html, and the result is independent to read_xml option as_html=TRUE or FALSE.  So we can use the package xml2 as an alternative solution instead of the package XML.

To leave a comment for the author, please follow the link and comment on their blog: R – ЯтомизоnoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Speaking at DataPhilly February 2016

Fri, 2016-02-05 13:29

(This article was first published on Rstats – bayesianbiologist, and kindly contributed to R-bloggers)

The next DataPhilly meetup will feature a medley of machine-learning talks, including an Intro to ML from yours truly. Check out the speakers list and be sure to RSVP. Hope to see you there!

Thursday, February 18, 2016

6:00 PM to 9:00 PM

Speakers:

  • Corey Chivers
  • Randy Olson
  • Austin Rochford

Corey Chivers (Penn Medicine)

Abstract: Corey will present a brief introduction to machine learning. In his talk he will demystify what is often seen as a dark art. Corey will describe how we “teach” machines to learn patterns from examples by breaking the process into its easy-to-understand component parts. By using examples from fields as diverse as biology, health-care, astrophysics, and NBA basketball, Corey will show how data (both big and small) is used to teach machines to predict the future so we can make better decisions.

Bio: Corey Chivers is a Senior Data Scientist at Penn Medicine where he is building machine learning systems to improve patient outcomes by providing real-time predictive applications that empower clinicians to identify at risk individuals. When he’s not pouring over data, he’s likely to be found cycling around his adoptive city of Philadelphia or blogging about all things probability and data at bayesianbiologist.com.

Randy Olson (University of Pennsylvania Institute for Biomedical Informatics):

Automating data science through tree-based pipeline optimization

Abstract: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in business, academia, and government. In this talk, I’m going to introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning — pipeline design. All of the work presented in this talk is based on the open source Tree-based Pipeline Optimization Tool (TPOT), which is available on GitHub at https://github.com/rhiever/tpot.

Bio: Randy Olson is an artificial intelligence researcher at the University of Pennsylvania Institute for Biomedical Informatics, where he develops state-of-the-art machine learning algorithms to solve biomedical problems. He regularly writes about his latest adventures in data science at RandalOlson.com/blog, and tweets about the latest data science news at http://twitter.com/randal_olson.

Austin Rochford (Monetate):

Abstract: Bayesian optimization is a technique for finding the extrema of functions which are expensive, difficult, or time-consuming to evaluate. It has many applications to optimizing the hyperparameters of machine learning models, optimizing the inputs to real-world experiments and processes, etc. This talk will introduce the Gaussian process approach to Bayesian optimization, with sample code in Python.

Bio: Austin Rochford is a Data Scientist at Monetate. He is a former mathematician who is interested in Bayesian nonparametrics, multilevel models, probabilistic programming, and efficient Bayesian computation.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – bayesianbiologist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Introducing Microsoft R Open: Replay and slides

Fri, 2016-02-05 12:32

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

We had a fantastic turnout to last week's webinar, Introduction to Microsoft R Open. If you missed it, you can watch the replay below. In the talk, I gives some background on the R language and its applications, describe the performance and reproducibility benefits of Microsoft R Open, and give a demonstration of the basics of the R language along with a more in-depth demo of producing a beautiful weather data chart with R.

 

 

You can also check out the slides (and follow the embedded links) below:

Check out the other webinars in the Microsoft R Series as well.

Microsoft Webinars: Introduction to Microsoft R Open

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Shiny Developer Conference 2016 Recap

Fri, 2016-02-05 12:17

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

This is a guest post from VP Nagraj, a data scientist embedded within UVA’s Health Sciences Library, who runs our Data Analysis Support Hub (DASH) service. Last weekend I was fortunate enough to be able to participate in the first ever Shiny Developer Conference hosted by RStudio at Stanford University. I’ve built a handful of apps, and have taught an introductory workshop on Shiny. In spite of that, almost all of the presentations de-mystified at least one aspect of the how, why or so what of the framework. Here’s a recap of what resonated with me, as well as some code and links out to my attempts to put what I digested into practice. tl;dr
  • reactivity is a beast
  • javascript isn’t cheating
  • there are already a ton of shiny features … and more on the way
reactivity For me, understanding reactivity has been one of the biggest challenges to using Shiny … or at least to using Shiny well. But after > 3 hours of an extensive (and really clear) presentation by Joe Cheng, I think I’m finally starting to see what I’ve been missing. Here’s something in particular that stuck out to me:
output$plot = renderPlot() is not an imperative to the browser to do a what … it’s a recipe for how the browser should do something.
Shiny ‘render’ functions (e.g. renderPlot(), renderText(), etc) inherently depend on reactivity. What the point above emphasizes is that assignments to a reactive expression are not the same as assignments made in “regular” R programming. Reactive outputs depend on inputs, and subsequently change as those inputs are manipulated. If you want to watch how those changes happen in your own app, try adding options(shiny.reactlog=TRUE) to the top of your server script. When you run the app in a browser and press COMMAND + F3 (or CTRL + F3 on Windows) you’ll see a force directed network that outlines the connections between inputs and outputs. Another way to implement reactivity is with the reactive() function.
For my apps, one of the pitfalls has been re-running the same code multiple times. That’s a perfect use-case for reactivity outside of the render functions. Here’s a trivial example: library(shiny)

ui = fluidPage(
numericInput("threshold", "mpg threshold", value = 20),
plotOutput("size"),
textOutput("names")
)

server = function(input, output) {

output$size = renderPlot({

dat = subset(mtcars, mpg > input$threshold)
hist(dat$wt)

})

output$names = renderText({

dat = subset(mtcars, mpg > input$threshold)
rownames(dat)

})
}

shinyApp(ui = ui, server = server)
The code above works … but it’s redundant. There’s no need to calculate the “dat” object separately in each render function. The code below does the same thing but stores “dat” in a reactive that is only calculated once. library(shiny)

ui = fluidPage(
numericInput("threshold", "mpg threshold", value = 20),
plotOutput("size"),
textOutput("names")
)

server = function(input, output) {

dat = reactive({

subset(mtcars, mpg > input$threshold)

})

output$size = renderPlot({

hist(dat()$wt)

})

output$names = renderText({

rownames(dat())

})
}

shinyApp(ui = ui, server = server)
javascript For whatever reason I’ve been stuck on the idea that using JavaScript inside a Shiny app would be “cheating”. But Shiny is actually well equipped for extensions with JavaScript libraries. Several of the speakers leaned in on this idea. Yihui Xie presented on the DT package, which is an interface to use features like client-side filtering from the DataTables library. And Dean Attali demonstrated shinyjs, a package that makes it really easy to incorporate JavaScript operations. Below is code for a masterpiece that that does some hide() and show(): # https://apps.bioconnector.virginia.edu/game
library(shiny)
library(shinyjs)
shinyApp(

ui = fluidPage(
titlePanel(actionButton("start", "start the game")),
useShinyjs(),
hidden(actionButton("restart", "restart the game")),
tags$h3(hidden(textOutput("game_over")))
),

server = function(input, output) {

output$game_over =
renderText({
"game over, man ... game over"
})

observeEvent(input$start, {

show("game_over", anim = TRUE, animType = "fade")
hide("start")
show("restart")
})

observeEvent(input$restart, {
hide("game_over")
hide("restart")
show("start")
})

}
)
everything else brushing http://shiny.rstudio.com/articles/plot-interaction.html Adding a brush argument to plotOutput() let’s you click and drag to select a points on a plot. You can use this for “zooming in” on something like a time series plot. Here’s the code for an app I wrote based on data from the babynames package – in this case the brush let’s you zoom to see name frequency over specific range of years. # http://apps.bioconnector.virginia.edu/names/
library(shiny)
library(ggplot2)
library(ggthemes)
library(babynames)
library(scales)

options(scipen=999)

ui = fluidPage(titlePanel(title = "names (1880-2012)"),
textInput("name", "enter a name"),
actionButton("go", "search"),
plotOutput("plot1", brush = "plot_brush"),
plotOutput("plot2"),
htmlOutput("info")

)

server = function(input, output) {

dat = eventReactive(input$go, {

subset(babynames, tolower(name) == tolower(input$name))

})

output$plot1 = renderPlot({

ggplot(dat(), aes(year, prop, col=sex)) +
geom_line() +
xlim(1880,2012) +
theme_minimal() +
# format labels with percent function from scales package
scale_y_continuous(labels = percent) +
labs(list(title ="% of individuals born with name by year and gender",
x = "n click-and-drag over the plot to 'zoom'",
y = ""))

})

output$plot2 = renderPlot({

# need latest version of shiny to use req() function
req(input$plot_brush)
brushed = brushedPoints(dat(), input$plot_brush)

ggplot(brushed, aes(year, prop, col=sex)) +
geom_line() +
theme_minimal() +
# format labels with percent function from scales package
scale_y_continuous(labels = percent) +
labs(list(title ="% of individuals born with name by year and gender",
x = "",
y = ""))

})

output$info = renderText({

"data source: social security administration names from babynames package

"

})

}

shinyApp(ui, server)
gadgets http://shiny.rstudio.com/articles/gadgets.html A relatively easy way to leverage Shiny reactivity for visual inspection and interaction with data within RStudio. The main difference here is that you’re using an abbreviated (or ‘mini’) ui. The advantage of this workflow is that you can include it in your script to make your analysis interactive. I modified the example in the documentation and wrote a basic brushing gadget that removes outliers: library(shiny)
library(miniUI)
library(ggplot2)

outlier_rm = function(data, xvar, yvar) {

ui = miniPage(
gadgetTitleBar("Drag to select points"),
miniContentPanel(
# The brush="brush" argument means we can listen for
# brush events on the plot using input$brush.
plotOutput("plot", height = "100%", brush = "brush")
)
)

server = function(input, output, session) {

# Render the plot
output$plot = renderPlot({
# Plot the data with x/y vars indicated by the caller.
ggplot(data, aes_string(xvar, yvar)) + geom_point()
})

# Handle the Done button being pressed.
observeEvent(input$done, {

# create id for data
data$id = 1:nrow(data)

# Return the brushed points. See ?shiny::brushedPoints.
p = brushedPoints(data, input$brush)

# create vector of ids that match brushed points and data
g = which(p$id %in% data$id)

# return a subset of the original data without brushed points
stopApp(data[-g,])
})
}

runGadget(ui, server)
}

# run to open plot viewer
# click and drag to brush
# press done return a subset of the original data without brushed points
library(gapminder)
outlier_rm(gapminder, "lifeExp", "gdpPercap")

# you can also use the same method above but pass the output into a dplyr pipe syntax
# without the selection what is the mean life expectancy by country?
library(dplyr)
outlier_rm(gapminder, "lifeExp", "gdpPercap") %>%
group_by(country) %>%
summarise(mean(lifeExp))
req() http://shiny.rstudio.com/reference/shiny/latest/req.html This solves the issue of requiring an input – I’m definitely going to use this so I don’t have to do the return(NULL) work around: # no need to do do this any more
#
# inFile = input$file1
#
# if (is.null(inFile))
# return(NULL)

# use req() instead
req(input$file1)
profvis http://rpubs.com/wch/123888 Super helpful method for digging into the call stack of your R code to see how you might optimize it. One or two seconds of processing can make a big difference, particularly for a Shiny app … rstudio connect https://www.rstudio.com/rstudio-connect-beta/ Jeff Allen from RStudio gave a talk on deployment options for Shiny applications and mentioned this product, which is a “coming soon” platform for hosting apps alongside RMarkdown documents and plots. It’s not available as a full release yet, but there is a beta version for testing. ​ Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Cricket analytics with cricketr in paperback and Kindle versions

Fri, 2016-02-05 10:59

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

My book “Cricket analytics with cricketr” is now available in paperback and Kindle versions. The paperback is available from Amazon (US, UK and Europe) for $ 48.99. The Kindle version can be downloaded from the Kindle store for $2.50 (Rs 169/-). Do pick your copy. It should be a good read for a Sunday afternoon.

This book of mine contains my posts based on my R package ‘cricketr’ now in CRAN. The package cricketr can analyze both batsmen and bowlers for all formats of the game Test, ODI and Twenty20. The package uses the data from ESPN Cricinfo. The analyses include runs frequency charts, performances of batsmen and bowlers in different grounds and against different teams, moving  average of  runs/wickets over the career, mean strike rate, mean economy rate and so on.

The book includes the following chapters based on my R package cricketr  There are 2 additional articles where I use Machine Learning with the package Octave.

CONTENTS
Cricket Analytics with cricketr 11
1.1. Introducing cricketr! : An R package to analyze performances of cricketers 11
1.2. Taking cricketr for a spin – Part 1 49
1.2. cricketr digs the Ashes! 70
1.3. cricketr plays the ODIs! 99
1.4. cricketr adapts to the Twenty20 International! 141
1.5. Sixer – R package cricketr’s new Shiny avatar 170
2. Other cricket posts in R 180
2.1. Analyzing cricket’s batting legends – Through the mirage with R 180
2.2. Mirror, mirror … the best batsman of them all? 206
3. Appendix 220
Cricket analysis with Machine Learning using Octave 220
3.1. Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid 221
3.2. Informed choices through Machine Learning-2 Pitting together Kumble, Kapil, Chandra 234
Further reading 248
Important Links 249

I do hope you have a great time reading it. Do pick up your copy. Feel free to get in touch with me with your comments and suggestions.  I have more interesting things lined up for the future.

Watch this space!

You may also like
1. Literacy in India : A deepR dive.
2. Natural Language Processing: What would Shakespeare say?
3. Revisiting crimes against women in India
4. Experiments with deblurring using OpenCV
5. TWS-4: Gossip protocol: Epidemics and rumors to the rescue
6. Bend it like Bluemix, MongoDB with autoscaling – Part 1
7. “Is it animal? Is it an insect?” in Android

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New Version of “Wrangling F1 Data With R” Just Released…

Fri, 2016-02-05 09:45

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

So I finally got round to pushing a revised (and typo corrected!) version of Wrangling F1 Data With R: A Data Junkie’s Guide, that also includes a handful of new section and chapters, including descriptions of how to detect undercuts, the new style race history chart that shows the on-track position of each driver for each lap of a race relative to the lap leader, and a range of qualifying session analysis charts that show the evolution of session cut off times and drivers’ personal best times.

Code is described for every data manipulation and chart that appears in the book, along with directions for how to get hold of (most of) the data required to generate the charts. (Future updates will cover some of the scraping techniques required to get of of the rest of it!)

As well as the simple book, there’s also a version bundled with the R code libraries that are loaded in as a short-cut in many of the chapters.

The book is published on Leanpub, which means you can access several different electronic versions of the book, and once you’ve bought a copy, you get access to any future updates for no further cost…

There is a charge on the book, with a set minimum price, but you also have the freedom to pay more! Any monies received for this book go to cover costs (I’ve started trying to pay for the webservices I use, rather than just keep using their free plan). If the monthly receipts bump up a little, I’ll try to get some services that generate some of the charts interactively hosted somewhere…

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Data from the World Health Organization API

Fri, 2016-02-05 06:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Eric Persson released yesterday a new WHO R package which allows easy access to the World Health Organization’s data API. He’s also done a nice vignette introducing its use.

I had a play and found it was easy access to some interesting data. Some time down the track I might do a comparison of this with other sources, the most obvious being the World Bank’s World Development Indicators, to identify relative advantages – there’s a lot of duplication of course. It’s a nice problem to have, too much data that’s too easy to get hold of. I wish we’d had that problem when I studied aid and development last century – I vividly remember re-keying numbers from almanac-like hard copy publications, and pleased we were to have them too!

Here’s a plot showing country-level relationships between the latest data of three indicators – access to contraception, adolescent fertility, and infant mortality – that help track the Millennium Development Goals.

Note that there’s some micro-interactivity here – you can move the labels around, zoom in on a part of the plot, or reposition the whole thing. It’s easier to explor in the full screen version. Thanks to Julien Barnier’s scatterD3 R package for making this JavaScript D3 functionality easily available from R.

Here are the exploratory plots I made on the way to that. All the code is reproduced at the bottom of the post.

Infant mortality has been increasing steadily since 1990 when these data start, and in fact from decades before too. Sad blip in Haiti from the 2010 earthquake:

The data on adolescent fertility is surprisingly sparse, with only one data point per country and many of them quite old. While Africa is often in a class of its own on development indicators, this graphic might surprise people in showing how much teen pregnancy is such an issue in Africa (note that the inclusion of Sudan and Somalia in “Eastern Mediterranean” region wasn’t done by me):

Data on contraception use is also sparse; mostly one data point per country, but a few have two. There’s not much that stands out here other than the high usage in the wealthier countries of Europe (the blue ones) and low in the poor countries of Africa (red).

Here’s the R code that did that. Hopefully other people will pick up on this – good work from the WHO for their database and API, and Eric Persson for making it easy to access from R.

# basic functionality library(WHO) library(dplyr) library(ggplot2) library(scales) library(showtext) library(scatterD3) # note - this is the GitHub version so we have ellipses library(stringr) library(RColorBrewer) library(dygraphs) # ---------------default fonts, themes and colour scale font.add.google("Poppins", "myfont") showtext.auto() mdg_theme <- theme_light(base_family = "myfont") + theme(legend.position = "bottom") theme_set(mdg_theme) scale_colour_discrete <- function(...) { scale_colour_manual("", values = brewer.pal(6, "Spectral")[c(1,2,5,6)], ...) } #------------------data prep------------- # download all WHO codes codes <- get_codes() dim(codes) # 2144 codes mdg_codes <- codes[grepl("^MDG", codes$label), ] mdg_codes$number <- as.numeric(str_sub(mdg_codes$label, start = -2)) dim(mdg_codes) # 33 for MDGs mdg_codes$label #----------------helper functions--------------- prep <- function(data){ # remove regional groupings, and make income a factor with levels in correct order data %>% filter(!worldbankincomegroup %in% c("Global", "NA")) %>% filter(!is.na(region)) %>% filter(!is.na(country)) %>% mutate(worldbankincomegroup = factor(worldbankincomegroup, levels = c("Low-income", "Lower-middle-income", "Upper-middle-income", "High-income"))) } latest <- function(data, nudge = 1){ # return a cut back data frame of just the latest value, useful for geom_text annotations data %>% group_by(country) %>% filter(year == max(year)) %>% mutate(year = year + nudge) } #-------------------Infant mortality-------------- mdg1 <- get_data("MDG_0000000001") # values are a funny mixture of points and intervals, so we extract just the points: mdg1$value_n <- as.numeric(str_extract(mdg1$value, "[0-9]*\.[0-9]*")) mdg1a <- mdg1 %>% prep() p1 <- mdg1a %>% ggplot(aes(x = year, y = value_n, colour = worldbankincomegroup, group = country)) + geom_line() + facet_wrap(~region) + geom_text(data = latest(mdg1a), aes(label = country), hjust = 0, size = 3) + xlim(1990, 2025) + labs(x = "Year", y = mdg_codes[1, "display"], title = "'Reduce child mortality' MDG Indicator 14, infant mortality rate") svg("..http://ellisp.github.io/img/0029-mdg1.svg", 10, 9) print(p1) dev.off() #--------------Adolescent fertility---------------- mdg3 <- get_data("MDG_0000000003") # This isn't listed at http://www.unmillenniumproject.org/goals/gti.htm as an MDG indicator, # but according to http://www.wikigender.org/wiki/adolescent-birth-rate/ it's part of # measuring Goal 5 "Improve Maternal Health" svg("..http://ellisp.github.io/img/0029-mdg3.svg", 10, 9) mdg3 %>% prep() %>% ggplot(aes(x = year, y = value, label = country, colour = worldbankincomegroup)) + labs(x = "Year of latest data", y = mdg_codes[2, "display"], title = "'Improve maternal health' MDG un-numbered indicator, adolescent fertility") + geom_text() + facet_wrap(~region) + scale_colour_manual("", values = brewer.pal(6, "Spectral")[c(1,2,5,6)]) + theme(legend.position = "bottom") dev.off() #----------contraceptive prevalence---------- mdg5 <- get_data("MDG_0000000005") # for some odd reason there's no income group so we make it up: mdg5a <- prep(mdg5) %>% select(-worldbankincomegroup) %>% left_join(unique(mdg1a[ , c("country", "worldbankincomegroup")])) p5 <- mdg5a %>% prep() %>% ggplot(aes(x = year, y = value / 100, group = country, colour = worldbankincomegroup)) + facet_wrap(~region) + scale_y_continuous(mdg_codes[3, "display"], label = percent) + labs(x = "Year of latest data", title = "'Combat HIV/AIDS, malaria and other diseases' MDG indicator 19c, contraception prevalence") + xlim(1990, 2020) + geom_point() + geom_line() + geom_text(data = latest(mdg5a, 0.3), aes(label = country), hjust = 0, size = 3) svg("..http://ellisp.github.io/img/0029-mdg5.svg", 10, 9) print(p5) dev.off() #==================scatter plot================== # combine the three datasets into just one and knock out countries with missing values comb <- latest(mdg3) %>% select(country, region, worldbankincomegroup, value) %>% rename(AdolFert = value) %>% left_join(latest(mdg5)[, c("country", "value")], by = "country") %>% rename(Contra = value) %>% left_join(latest(mdg1)[ , c("country", "value_n")], by = "country") %>% rename(InfantMortality = value_n) %>% filter(!is.na(country)) %>% filter(!is.na(Contra)) # draw scatter plot scatterD3(x = comb$Contra, y = comb$AdolFert, lab = comb$country, size_var = comb$InfantMortality, col_var=comb$worldbankincomegroup, symbol_var = comb$region, xlab = "Access to contraception (%)", ylab = "Adolescent fertility rate (per 1000 girls aged 15-19)", col_lab = "", symbol_lab = "", size_lab = "Infant Mortality per 1000", ellipses = TRUE, ellipses_level = 0.75) # note - next step to get into a web page requires manually saving it.

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Alternate R Markdown Templates

Thu, 2016-02-04 21:27

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The knitr/R markdown system is a great way to organize reports and analyses. However, the built-in ones (that come with RStudio/the rmarkdown package) rely on Bootstrap and also use jQuery. There’s nothing wrong with that, but the generated standalone HTML documents (which are a great way to distribute reports) don’t really need all that cruft and it’s fun & informative to check out new frameworks from time-to-time. Also, jQuery is a heavy crutch I’m working hard to not need anymore.

To that end, I created a package — markdowntemplates — that contains three alternate templates that you can use out of the box or (hopefully) clone, customize and use in your future work. One template is based on the Bulma CSS framework, the other is based on the Skeleton CSS framework and the last one is a super-minimal template with no formatting (i.e. it’s a good one to build on).

The github repo has screenshots of the basic formatting.

I tried to keep with the base formatting of each theme, but I went a bit crazy and showed how to have a fixed banner in the Skeleton version.

How it works

There are three directories inside inst/rmarkdown/templates each has a similar structure:

  • a resources directory with CSS (and potentially javascript)
  • a skeleton directory which holds example Rmd “skeleton” files
  • a base.html file which is the parameterized HTML template for the Rmd
  • a template.yaml file which is how RStudio/knitr knows there’s one or more R markdown templates in your package

The minimal base.html is small enough to include here:

<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"$if(lang)$ lang="$lang$" xml:lang="$lang$"$endif$>   <head>   <meta charset="utf-8"> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="pandoc" /> <meta name="viewport" content="width=device-width, initial-scale=1">   <title>$if(title)$$title$$endif$</title>   $for(header-includes)$ $header-includes$ $endfor$   $if(highlightjs)$ <style type="text/css">code{white-space: pre;}</style> <link rel="stylesheet" href="$highlightjs$/$if(highlightjs-theme)$$highlightjs-theme$$else$default$endif$.css" $if(html5)$$else$type="text/css" $endif$/> <script src="$highlightjs$/highlight.js"></script> <script type="text/javascript"> if (window.hljs && document.readyState && document.readyState === "complete") { window.setTimeout(function() { hljs.initHighlighting(); }, 0); } </script> $endif$   $if(highlighting-css)$ <style type="text/css">code{white-space: pre;}</style> <style type="text/css"> $highlighting-css$ </style> $endif$   $for(css)$ <link rel="stylesheet" href="$css$" $if(html5)$$else$type="text/css" $endif$/> $endfor$   </head>   <body> <div class="container">   <h1>$if(title)$$title$$endif$</h1>   $for(include-before)$ $include-before$ $endfor$   $if(toc)$ <div id="$idprefix$TOC"> $toc$ </div> $endif$   $body$   $for(include-after)$ $include-after$ $endfor$   </div>   $if(mathjax-url)$ <!-- dynamically load mathjax for compatibility with self-contained --> <script> (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "$mathjax-url$"; document.getElementsByTagName("head")[0].appendChild(script); })(); </script> $endif$   </body> </html>

I kept a bit of the RStudio template code for source code formatting, but grokking the actual template language should be pretty straightforward. You should be able to pick out $title$ in there and you can add as many parameters to the Rmd YAML section as you like (with corresponding counterparts in that template). I added a corresponding, exported R function for each supported template to show how easy it is to customize the parameters while also accepting further customizations in the YAML of each Rmd.

Imagine building a base template with your personal or organization’s branding and having it set apart from the cookie-cutter RStudio rmarkdown Bootstrap/jQuery template. Or, create course-specific templates like the s20x package. Once you peek behind the curtain to see how things are done, it’s not so complex and the sky is the limit.

I’ll try to get these in CRAN as soon as possible. If you have a preference for another CSS framework (I’m thinking of adding a “Metro” CSS kit and a Google web starter CSS kit), shoot me an issue or PR and I’ll incorporate it. The more examples we have the easier it will be for folks to create new templates.

Any & all feedback is most welcome.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

OpenCPU Server Release 1.5.4

Thu, 2016-02-04 19:00

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

Version 1.5.4 of the OpenCPU server has been released to Launchpad (Ubuntu) and OBS (Fedora). This update does not introduce any changes to the OpenCPU API itself; it improves to the deb/rpm installation packages and upgrades the bundled opencpu system R package library.

Installing and Updating

Existing Ubuntu and Fedora serves that are already running the 1.5 branch will automatically update the next time they run apt-get update or yum update. Alternatively, to install OpenCPU server on a fresh Ubuntu 14.04 machine:

sudo add-apt-repository -y ppa:opencpu/opencpu-1.5 sudo apt-get update sudo apt-get install -y opencpu

Or to install it on Fedora 22 or 23 from OBS:

cd /etc/yum.repos.d/ wget http://download.opensuse.org/repositories/home:jeroenooms:opencpu-1.5/Fedora_23/home:jeroenooms:opencpu-1.5.repo yum install opencpu

To install OpenCPU server on other distributions, simplfy follow the instructions to build the deb (Debian/Ubuntu) or rpm (Fedora/CentOS/RHEL) packages from source, which is very easy.

The OpenCPU Package Library

Because OpenCPU is implemented completely in R, the server stack ships with a private library of R packages needed by the system in /usr/lib/opencpu/library. The isolated library allows you to freely install/upgrade/uninstall your own R packages on your server without accidentaly breaking the OpenCPU server. This is critical to guarantee the system is stable at all times and unaffected by whatever crazy things are happening in R.

However a side effect of this design is that for these system packages, the user might see a different package version when calling R via the OpenCPU API than when running R from the terminal on the same server. This is unfortunate because the OpenCPU is meant to provide a transparent HTTP API to the system’s R installation. One solution would be to add the opencpu library to your .libPaths() but this is unnecessarily annoying and complicated.

To make this easier, the OpenCPU rpm/deb packages now automatically create symlinks to the OpenCPU system library in the global R package library. Thereby the OpenCPU system library is still safely isolated, but the packages are also visible when running R in the terminal, hence we don’t need to install them again. Hopefully this makes managing packages on your OpenCPU server a little easier.

To leave a comment for the author, please follow the link and comment on their blog: OpenCPU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Free video course: applied Bayesian A/B testing in R

Thu, 2016-02-04 15:29

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

As a “thank you” to our blog, mailing list, and Twitter followers (@WinVectorLLC) we at Win-Vector LLC have decided to re-release our formerly fee-based A/B testing video course as a free (advertisement supported) video course here on Youtube.


The course emphasizes how to design A/B tests using prior “guestimates” of effect sizes (often you have these from prior campaigns, or somebody claims an effect size and it is merely your job to confirm it). It is fairly technical, and the emphasis is Bayesian- where we are trying to get an actual estimate of the distribution unknown true expected payoff rate of the various campaigns (the so-called posteriors). We show how to design and evaluate a sales campaigns for a product at two different price points.

The solution is coded in R and Nina Zumel has contributed an updated Shiny user interface demonstrating the technique (for more on Shiny, please see here). The code for the calculation methods and older shiny app are shared here.

This sort of fills out our survey of ways to think about A/B testing:

We have a lot more material on statistics and data science (though not on A/B testing) in our book and our paid video course Introduction to Data Science.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Weekly R-Tips: Visualizing Predictions

Thu, 2016-02-04 14:59

(This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers)

Lets say that we estimated a linear regression model on time series data with lagged predictors. The goal is to estimate sales as a function of inventory, search volume, and media spend from two months ago. After using the lm function to perform linear regression, we predict sales using values from two month ago.

frmla <- sales ~ inventory + search_volume + media_spend mod <- lm(frmla, data=dat) pred = predict(mod, values, interval="predict")

If this model is estimated weekly or monthly, we will eventually want to understand how well our model did in predicting actual sales from month to month. To perform this task, we must regularly maintain a spreadsheet or data structure (RDS object) with actual predicted sales figures for each time period. That data can be used to create line graphs that visualize both the actual versus predicted values.

ggplot(mydat, aes(Month, value, group=variable, colour=variable)) + geom_line(lwd=1.05) + geom_point(size=2.5) + ggtitle("Sales (01/2010 to 05/2015)") + xlab("Date") + ylab("Sales") + ylim(0,30000) + xlab(" ") + ylab(" ") + theme(legend.title=element_blank()) + xlab(" ") + theme(axis.text.x=element_text(colour="black")) + theme(axis.text.y=element_text(colour="black")) + theme(legend.position=c(.4, .85))

Above is an example of what the final product could look like. Visualizing predicted against actual values is an important component of evaluating the quality of a model. Furthermore, having such visualization will be of value when interacting with business audiences and “selling” your analysis.

To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Predicting wine quality using Random Forests

Thu, 2016-02-04 13:49

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Hello everyone! In this article I will show you how to run the random forest algorithm in R. We will use the wine quality data set (white) from the UCI Machine Learning Repository.

What is the Random Forest Algorithm?

In a previous post, I outlined how to build decision trees in R. While decision trees are easy to interpret, they tend to be rather simplistic and are often outperformed by other algorithms. Random Forests are one way to improve the performance of decision trees. The algorithm starts by building out trees similar to the way a normal decision tree algorithm works. However, every time a split has to made, it uses only a small random subset of features to make the split instead of the full set of features (usually (sqrt[]{p}), where p is the number of predictors). It builds multiple trees using the same process, and then takes the average of all the trees to arrive at the final model. This works by reducing the amount of correlation between trees, and thus helping reduce the variance of the final tree. The simplest way to understand this is (as explained in Introduction to Statistical Learning): if you have some numbers (Z_1, Z_2,…,Z_n) with a variance of (sigma^2), then their mean (overline{Z}) will have variance (sigma^2/n).

Exploring Data Analysis

Let us read in the data and explore it. We can read in the data directly from the page using the read.table function.

url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv' wine <- read.table(url) head(wine) fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality 1 7.0 0.27 0.36 20.7 0.045 45 170 1.0010 3.00 0.45 8.8 6 2 6.3 0.30 0.34 1.6 0.049 14 132 0.9940 3.30 0.49 9.5 6 3 8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 6 4 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 6 5 7.2 0.23 0.32 8.5 0.058 47 186 0.9956 3.19 0.40 9.9 6 6 8.1 0.28 0.40 6.9 0.050 30 97 0.9951 3.26 0.44 10.1 6

Let us look at the distribution of the wine quality. We can use barplot for this.

barplot(table(wine$quality))

The barplot:

As we can see, there are a lot of wines with a quality of 6 as compared to the others. The dataset description states – there are a lot more normal wines than excellent or poor ones. For the purpose of this discussion, let’s classify the wines into good, bad, and normal based on their quality.

wine$taste <- ifelse(wine$quality < 6, 'bad', 'good') wine$taste[wine$quality == 6] <- 'normal' wine$taste <- as.factor(wine$taste)

This will classify all wines into bad, normal, or good, depending on whether their quality is less than, equal to, or greater than 6 respectively. Let’s look at the distribution again.

table(wine$taste) bad good normal 1640 1060 2198

Before we build our model, let’s separate our data into testing and training sets.

set.seed(123) samp <- sample(nrow(wine), 0.6 * nrow(wine)) train <- wine[samp, ] test <- wine[-samp, ]

This will place 60% of the observations in the original dataset into train and the remaining 40% of the observations into test.

Building the model

Now, we are ready to build our model. We will need the randomForest library for this.

library(randomForest) model <- randomForest(taste ~ . - quality, data = train)

We can use ntree and mtry to specify the total number of trees to build (default = 500), and the number of predictors to randomly sample at each split respectively. Let’s take a look at the model.

model Call: randomForest(formula = taste ~ . - quality, data = train) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 3 OOB estimate of error rate: 29.54% Confusion matrix: bad good normal class.error bad 683 16 274 0.2980473 good 16 404 229 0.3775039 normal 222 111 983 0.2530395

We can see that 500 trees were built, and the model randomly sampled 3 predictors at each split. It also shows a matrix containing prediction vs actual, as well as classification error for each class. Let’s test the model on the test data set.

pred <- predict(model, newdata = test) table(pred, test$taste) pred bad good normal bad 482 10 130 good 14 252 85 normal 171 149 667

We can test the accuracy as follows:

(482 + 252 + 667) / nrow(test) 0.7147959

There we have it! We achieved ~71.5% accuracy with a very simple model. It could be further improved by feature selection, and possibly by trying different values of mtry.

That brings us to the end of the article. I hope you enjoyed it! As always, if you have questions or feedback, feel free to reach out to me on Twitter or leave a comment below!

To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Using Microsoft R Open with RStudio

Thu, 2016-02-04 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

A frequent question that we get here at Microsoft about MRO (Microsoft R Open) is: can be used with RStudio? The short answer is absolutely yes! In fact, more than just being compatible, MRO is the perfect complement for the RStudio environment. MRO is a downstream distribution of open source R that supports multiple operating systems and provides features that enhance the performance and reproducible use of the R language.  RStudio, being much more than a simple IDE, provides several features such as the tight integration knitr, RMarkdown and Shiny that promote literate programming, the creation of reproducible code as well as sharing and collaboration. Together, MRO and RStudio they make a powerful combination. Before elaborating on this theme, I should just make it clear how to select MRO from the RStudio IDE. After you have installed MRO on your system, open RStudio, go to the "Tools" tab at the top, and select "Global Options". You should see a couple of pop-up windows like the screen capture below. If RStudio is not already pointing to MRO (like it is in the screen capture) browse to it, and click "OK".

 

One feature of MRO that dovetails nicely with RStudio is that way that MRO is tied to a fixed repository. Every day, at precisely midnight UTC, the infrastructure that supports the MRO distribution takes a snapshot of CRAN and stores it on Microsoft’s MRAN site. (You can browse through the snapshots back to September 17, 2014 from the CRAN Time Machine.) Each MRO release is pre-configured to point to a particular CRAN snapshot. MRO 3.2.3, for example, points to CRAN as it was on January 1, 2016. Everyone who downloads MRO is guaranteed to start from a common baseline that reflects CRAN and all of its packages as they existed at a particular point in time. This provides an enormous advantage for corporations and collaborating teams of R programmers who can be sure that they are at least starting off on the same page, all working with the same CRAN release and a consistent view of the universe of R packages.

However, introducing the discipline of a fixed repository into the RStudio workflow is not completely frictionless. Occasionally, the stars don’t line up perfectly and an RStudio user, or any other user that needs a particular version of a CRAN package for some reason, may have to take some action. For example, I recently downloaded MRO 3.2.3, fired up RStudio and thought “sure why not” when reminded that a newer version of RStudio was available. Then, I clicked to create a new rmarkdown file and was immediately startled by an error message that said that the available rmarkdown package was not the version required by RStudio. The easy fix, of course, was to point to a repository containing a more recent version of rmarkdown than the one associated with the default snapshot date. If this happens to you, either of the following will take care of things:        

To get the latest version of the  markdown package, use:
install.packages("rmarkdown", repos = "https://cran.revolutionanalytics.com")

To get the 0.9.2 version of the  markdown package, use:
install.packages("rmarkdown", repos = "https://mran.revolutionanalytics.com/snapshot/2016-01-02")

Apparently, by chance, we missed setting a snapshot date for MRO that would be convenient for RStudio users by one day,

A second way that MRO fits into RStudio is the way that the checkpoint package, which installs with MRO, can enhance the reproducibility power of RStudio’s project directory structure. If you choose a new directory when set up a new Rstudio project, and then run the checkpoint() function from that project, checkpoint will set up a local repository in a subdirectory of the project directory. For example, executing the following two lines of code from a script in the MyProject directory will install all packages required by your project as they were at midnight UTC on the specified date. 

library(checkpoint)
checkpoint("2016-01-29")

Versions of all of the packages that are called out by scripts in your MyProject directory that existed on CRAN on January 29, 2016 will be installed in a subfolder of MyProject underneath ~/.checkpoint. Unless you use the same checkpoint date for other projects, the packages for MyProject will be independent of packages installed for those other projects. This kind of project specific structure is very helpful for keeping things straight. It provides a reproducibility code sharing layer on top of (or maybe underneath) RStudio's GitHub integration and other reproducibility features. When you want to share code with a colleague they don't need to manually install all of the packages ahead of time. Just have them clone your GitHub repository or put your code into their own RStudio project in some other way and then have them run checkpoint() from there. Checkpoint will search through the scripts in their project and install the versions of the packages they need.

Finally, I should mention that MRO can enhance any project by providing multi-threaded processing to the code underlying many of the R functions you will be using. R functions that make use of linear algebra operations under the hood such as matrix multiplication of Choleesky decompositions etc. will get a considerable performance boost. (Look here for some benchmarks.) For Linux and Windows platforms users can enable multi-threaded processing by downloading and Installing the Intel Math Kernel Libraries (MKL) when they install MRO from the MRAN site. Mac OS X users automatically get multithreading because MRO comes pre-configured to use the Mac Accelerate Framework.

Let us know if you use RStudio with MRO.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs