R bloggers

Syndicate content
R news and tutorials contributed by (563) R bloggers
Updated: 2 hours 12 min ago

sjPlot 1.6 – major revisions, anyone for beta testing? #rstats

Thu, 2014-10-23 15:33

(This article was first published on Strenge Jacke! » R, and kindly contributed to R-bloggers)

In the last couple of weeks I have rewritten some core parts of my sjPlot-package and also revised the package- and online documentation.

Most notably are the changes that affect theming and appearance of plots and figures. There’s a new function called sjp.setTheme which now sets theme-options for all sjp-functions, which means

  1. you only need to specify theme / appearance option once and no longer need to repeat these parameter for each sjp-function call you make
  2. due to this change, all sjp-functions have much less parameters, making the functions and documentation clearer

Furthermore, due to some problems with connecting / updating to the RPubs server, I decided to upload my online documentation for the package to my own site. You will now find the latest, comprehensive documentation and examples for various functions of the sjPlot package at www.strengejacke.de/sjPlot/. For instance, take a look at customizing plot appearance and see how the new theming feature of the package allows both easier customization of plots as well as better integration of theming packages like ggthemr or ggthemes.

Updating the sjPlot package to CRAN is planned soon, however, I kindly ask you to test the current development snapshot, which is hosted on GitHub. You can easily install the package from there using the devtools-package (devtools::install_github("devel", "sjPlot")). The current snapshot is (very) stable and I appreciate any feedbacks or bug reports (if possible, use the issue tracker from GitHub).

The current change log with all new function, changes and bug fixes can also be found on GitHub.

Tagged: data visualization, Open Source, R, rstats, sjPlot

To leave a comment for the author, please follow the link and comment on his blog: Strenge Jacke! » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Happening just now… 6th Conference of the R Spanish User Community

Thu, 2014-10-23 13:31

(This article was first published on Tales of R » R, and kindly contributed to R-bloggers)

The R-Spain Conferences have been taking place since 2009 as an expression of the growing interest that R elicits in many fileds. The organisers are the Comunidad R Hispano (R-es). The community supports many groups and initiatives aimed to develop R knowledge and widen its use.

To attend the talks by streaming (they are in Spanish) you must registrate.

There is also a scientific programme with the presentations (some in English) here.

To leave a comment for the author, please follow the link and comment on his blog: Tales of R » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Introducing Rocker: Docker for R

Thu, 2014-10-23 12:39

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

You only know two things about Docker. First, it uses Linux containers. Second, the Internet won't shut up about it. -- attributed to Solomon Hykes, Docker CEO
So what is Docker? Docker is a relatively new open source application and service, which is seeing interest across a number of areas. It uses recent Linux kernel features (containers, namespaces) to shield processes. While its use (superficially) resembles that of virtual machines, it is much more lightweight as it operates at the level of a single process (rather than an emulation of an entire OS layer). This also allows it to start almost instantly, require very little resources and hence permits an order of magnitude more deployments per host than a virtual machine. Docker offers a standard interface to creation, distribution and deployment. The shipping container analogy is apt: just how shipping containers (via their standard size and "interface") allow global trade to prosper, Docker is aiming for nothing less for deployment. A Dockerfile provides a concise, extensible, and executable description of the computational environment. Docker software then builds a Docker image from the Dockerfile. Docker images are analogous to virtual machine images, but smaller and built in discrete, extensible and reuseable layers. Images can be distributed and run on any machine that has Docker software installed---including Windows, OS X and of course Linux. Running instances are called Docker containers. A single machine can run hundreds of such containers, including multiple containers running the same image. There are many good tutorials and introductory materials on Docker on the web. The official online tutorial is a good place to start; this post can not go into more detail in order to remain short and introductory. So what is Rocker? style="float:left;margin:10px 40px 10px 0;" width="100" height="100" src="https://en.gravatar.com/userimage/73204427/563567819bd642c7a9e3af9d8ddb7581.png?size=100"/> At its core, Rocker is a project for running R using Docker containers. We provide a collection of Dockerfiles and pre-built Docker images that can be used and extended for many purposes. Rocker is the the name of our GitHub repository contained with the Rocker-Org GitHub organization. Rocker is also the name the account under which the automated builds at Docker provide containers ready for download. Current Rocker Status Core Rocker Containers The Rocker project develops the following containers in the core Rocker repository
  • r-base provides a base R container to build from
  • r-devel provides the basic R container, as well as a complete R-devel build based on current SVN sources of R
  • rstudio provides the base R container as well an RStudio Server instance
We have settled on these three core images after earlier work in repositories such as docker-debian-r and docker-ubuntu-r. Rocker Use Case Containers Within the Rocker-org organization on GitHub, we are also working on
  • Hadleyverse which extends the rstudio container with a number of Hadley packages
  • rOpenSci which extends hadleyverse with a number of rOpenSci packages
  • r-devel-san provides an R-devel build for "Sanitizer" run-time diagnostics via a properly instrumented version of R-devel via a recent compiler build
  • rocker-versioned aims to provided containers with 'versioned' previous R releases and matching packages
Other repositories will probably be added as new needs and opportunities are identified. Deprecation The Rocker effort supersedes and replaces earlier work by Dirk (in the docker-debian-r and docker-ubuntu-r GitHub repositories) and Carl. Please use the Rocker GitHub repo and Rocker Containers from Docker.com going forward. Next Steps We intend to follow-up with more posts detailing usage of both the source Dockerfiles and binary containers on different platforms. Rocker containers are fully functional. We invite you to take them for a spin. Bug reports, comments, and suggestions are welcome; we suggest you use the GitHub issue tracker. Acknowledgments We are very appreciative of all comments received by early adopters and testers. We also would like to thank RStudio for allowing us the redistribution of their RStudio Server binary. Published concurrently at rOpenSci blog and Dirk's blog. Authors Dirk Eddelbuettel and Carl Boettiger

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A first look at Distributed R

Thu, 2014-10-23 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

One of the most interesting R related presentations at last week’s Strata Hadoop World Conference in New York City was the session on Distributed R by Sunil Venkayala  and Indrajit Roy, both of HP Labs. In short, Distributed R is an open source project with the end goal of running  R code in parallel on data that is distributed across multiple machines. The following figure conveys the general idea.


A master node controls multiple worker nodes each of which runs multiple R processes in parallel.

As I understand it, the primary use case for the Distributed R software is to move data quickly from a database into distributed data structures that can be accessed by multiple, independent R instances for coordinated, parallel computation. The Distributed R infrastructure automatically takes care of the extraction of the data and the coordination of the calculations, including the occasional movement of data from a worker node to the master node when required by the calculation. The user interface to the Distributed R mechanism is through R functions that have been designed and optimized to work with the distributed data structures, and through a special “Distributed R aware” foreach() function that allow users to write their own distributed functions using ordinary R functions.

To make all of this happen, Distributed R platform contains several components that may be briefly described as follows:

The distributed R package contains:

  • functions to set up the infrastructure for the distributed platform
  • distributed data structures that are the analogues of R’s data frames, arrays and lists, and
  • the functions foreach() and splits() to let users write their own parallel algorithms.

A really nice feature of the distributed data structures is that they can be populated and accessed by rows, columns and blocks making it possible to write efficient algorithms tuned to the structure of particular data sets. For example, data cleaning for wide data sets (many more columns than rows) can be facilitated by preprocessing individual features.

vRODBC is an ODBC client that provides R with database connectivity. This is the connection mechanism that permits the parallel loading of data from various sources data including HP’s Vertica database.

The HPdata package contains the functions that allow you to actually load distributed data structures from various data sources

The HPDGLM package implements a parallel, distributed GLM models (Presently only linear regression, logistic regression and Poisson regression models are available), The package also contains functions for cross validation and split-sample validation. 

The HPdclassifier package is intended to contain several distributed classification algorithms. It currently contains a parallel distributed implementation of the random forests algorithm.

The HPdcluster package contains a parallel, distributed kmeans algorithm.

The HPdgraph package is intended to contain distributed algorithms for graph analytics. It currently contains a parallel, distributed implementation of the  pagerank algorithm for directed graphs.

The following sample code, taken directly from the HPdclassifier User Guide, but modified slightly for presentation here, is similar to the examples that Venkayala and Roy showed in their presentation. Note, that after the distributed arrays are set up they are loaded in parallel with data using the the foreach function from the distributedR package.

library(HPdclassifier) # loading the library

Loading required package: distributedR
Loading required package: Rcpp
Loading required package: RInside
Loading required package: randomForest

distributedR_start() # starting the distributed environment

Workers registered - 1/1.
All 1 workers are registered.
[1] TRUE

nparts <- sum(ds$Inst) # number of available distributed instances

# Describe the data
nSamples <- 100 # number of samples
nAttributes <- 5 # number of attributes of each sample
nSpllits <- 1 # number of splits in each darray

# Create the distributed arrays
dax <- darray(c(nSamples,nAttributes),c(round(nSamples/nSpllits),nAttributes))

day <- darray(c(nSamples,1), c(round(nSamples/nSpllits),1))

# Load the distributed arrays
foreach(i, 1:npartitions(dax),
       x <- matrix(runif(nrow(x)*ncol(x)), nrow(x),ncol(x))
       y <- matrix(runif(nrow(y)), nrow(y), 1)

# Fit the Random Forest Model
myrf <- hpdrandomForest(dax, day, nExecutor=nparts)

# prediction
dp <- predictHPdRF(myrf, dax)

Notwithstanding all of its capabilities, Distributed R is still clearly work in progress. It is only available on Linux platforms. Algorithms and data must be resident in memory. Distributed R is not available on CRAN, and even with an excellent Installation Guide, installing the platform is a bit of an involved process.

Nevertheless, Distributed R is impressive, and I think a valuable contribution to open source R. I expect that users with distributed data will find the platform to be a viable way to begin high performance computing with R.

Note that the Distributed R project discussed in this post is an HP initiative and is not in anyway related to http://www.revolutionanalytics.com/revolution-r-enterprise-distributedr.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Making an R Package to use the HERE geocode API

Thu, 2014-10-23 10:29

(This article was first published on Stats and things, and kindly contributed to R-bloggers)

HERE is a product by Nokia, formerly called Nokia maps and before that, Ovi maps. It's the result of the acquisition of NAVTEQ in 2007 combined with Plazes and Metacarta, among others. It has a geocoding API, mapping tiles, routing services, and other things. I'm focused on the geocoding service. Under the “Base” license, you can run 10,000 geocoding requests per day. According to wikipedia, that is the most among the free geocoding services. On top of that, HERE does bulk encoding where you submit a file of things to geocode and it returns a link to download a file of results. This sure beats making thousands of requests one at a time.

I figured coding up a quick function to use this service would be the perfect reason to build my first R package. So, after getting my HERE API keys, I fired up devtools and got started with the R package.

First, I had to install the latest version of devtools and roxygen2

install.packages("devtools", repos="http://cran.rstudio.com/")
install.packages("roxygen2", repos="http://cran.rstudio.com/")

Next, I chose a directory, and ran the create() function from devtools. This creates a package skeleton that contains the basic structure needed for an R package.


Several files and directories are created after running create(). I moved over to the 'R' directory and created a file called “geocodeHERE_simple.R”. I put my function to use the HERE geocoding API in there. This function was designed to be minimalist and similar to the ggmap geocode() function.

#' Attempt to geocode a string
#' Enter a string and have latitude and longitude returned using the HERE API
#' @param search A string to search
#' @param App_id App_id to use the production HERE API. Get one here... http://developer.here.com/get-started. If left blank, will default to demo key with an unknown usage limit.
#' @param App_code App_code to use the production HERE API. Get one here... http://developer.here.com/get-started. If left blank, will default to demo key with an unknown usage limit.
#' @keywords geocode
#' @export
#' @examples
#' dontrun{
#' geocodeHERE_simple("chicago")
#' geocodeHERE_simple("wrigley field chicago IL")
#' geocodeHERE_simple("233 S Wacker Dr, Chicago, IL 60606")
#' }
#' geocodeHERE_simple
geocodeHERE_simple <- function(search, App_id="", App_code=""){
if(!is.character(search)){stop("'search' must be a character string")}
if(!is.character(App_id)){stop("'App_id' must be a character string")}
if(!is.character(App_code)){stop("'App_code' must be a character string")}

if(App_id=="" & App_code==""){
App_id <- "DemoAppId01082013GAL"
App_code <- "AJKnXv84fjrb0KIHawS0Tg"
base_url <- "http://geocoder.cit.api.here.com/6.2/geocode."
base_url <- "http://geocoder.api.here.com/6.2/geocode."

search <- RCurl::curlEscape(search)

final_url <- paste0(base_url, format, "?app_id=", ids$App_id, "&app_code=",
ids$App_code, "&searchtext=", search)

response <- RCurl::getURL(final_url)
response_parsed <- RJSONIO::fromJSON(response)
if(length(response_parsed$Response$View) > 0){
ret <- response_parsed$Response$View[[1]]$Result[[1]]$Location$DisplayPosition
ret <- NA

Note the text on the top of the code. That is added in order to automatically create help documentation for the function.

Now, if you look closely, I am using two other packages in that function… RCurl and RJSONIO. Also, notice that I'm not making any library() calls. This must be done in the DESCRIPTION file that is automatically generated by create(). In addition to calling out what packages to import, you input the package name, author, description, etc.

Package: geocodeHERE
Title: Wrapper for the HERE geocoding API
Version: 0.1
Authors@R: "Cory Nissen <corynissen@gmail.com> [aut, cre]"
Description: Wrapper for the HERE geocoding API
R (>= 3.1.1)
License: MIT
LazyData: true

Then, I set my working directory to the package root and ran the document() function from devtools that automatically creates package documentation.


From this point, you can upload your package to github and use install_github(), or use the install() function to install your package locally.


As far as the HERE geocoding API goes, I find it pretty decent at doing “fuzzy matching”. That is, given a place name instead of an address, it does a good job of correctly returning the coordinates. The downside is that you must register for an API key, where Google does not require registration to use it's geocoding service. But, you only get 2500 requests per day via Google, HERE offers 10,000 per day.

Any how, a big shout out to Hilary Parker for her blog post on creating an R package using devtools, Hadley Wickham for the devtools package (among others), and RStudio for the fantastic and free (and open source) IDE for R.

The geocodeHERE package is available on github. I'll be adding bulk geocoding functionality as time permits. You can install the package with the following code:


To leave a comment for the author, please follow the link and comment on his blog: Stats and things. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Extending methylKit : Extract promoters with differentially methylated CpGs

Thu, 2014-10-23 09:54

(This article was first published on CHITKA, and kindly contributed to R-bloggers)

In my previous post, I wrote about the features of methylKit. Here, I will discuss how to extend bisulfite sequencing data analysis beyond methylKit.

Annotation is an important feature of genomic analyses. Coming to bisulfite sequencing analyses such as RRBS or WGBS, methylKit does a pretty good job by identifying the differentially methylated individual CpGs or any specific regions/tiles. It also performs basic annotation and plots pie charts indicating where all the differentially methylated CpGs overlap the genic annotations.

The adjacent picture is an example of the kind of annotation performed by methylKit.Using the native functions, methylKit users will be able to annotate the differentially methylated CpGs to the genic regions. Adjacent picture indicates that among all the diffmeth
CpGs 46% overlap with promoters, 27% overlap with intergenic region. Another way to look at the annotations is to identify the list of promoters, exons, introns, intergenic regions that overlap differentially methylated CpGs. There are no methylKit functions to perform the annotations from the point of genic regions.However, methylKit facilitates this by enabling the coercion of the methylKit objects into "GRanges" (GenomicRanges). The following script will help methylKit users in extracting the list of promoter/exons/introns that overlap with differentially methylated CpGs.

To leave a comment for the author, please follow the link and comment on his blog: CHITKA. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Leveraging R for Econ Job Market

Thu, 2014-10-23 08:07

(This article was first published on fReigeist » R, and kindly contributed to R-bloggers)

I wanted to describe a little helper I am using to help refine the places I want to apply at since I am going to be on the Economics Job Market this year.

The two main websites were job openings are advertised are:

Now JOE has a really nice feature where you can download simply all job openings into a nice spreadsheet. This allows you to browse through and refine your search. Econ Job Market does not have such a feature.

The Listing page is quite annoying…


If you want more details for a job opening, such as a description of the fields and application requirements, you will have to click on “more info…”. In the JOE spreadsheet, you have all that information at once.

I wanted to create a JOE like spreadsheet using the openings from EJM. Some of which, of course, do overlap. But for some reason, some jobs are only on EJM but not on JOE.

So how can we use R to help us do that?

The first thing I wanted to do is get the simple listings on the main application page from EJM as a spreadsheet. You can simply download the main listings file and extract the pieces of information. Most important is the “posid” field, which is the position ID contained in the EJM URL. This will give you a direct link to the HTML page of the job opening and it also tells you whether you can apply through EJM.

This leaves you with a listing of EJM links to jobs, their position ID and the EJM Application Link in case the Job Opening accepts applications through EJM. Now you can proceed to simply download all the HTML files using a batch downloader such as DownThemAll. If you want to do that, you can print out a list:

cat(EJM$url, sep="n")

and enter them into Down Them All. Alternatively you can iterate through the list of links and send HTTP queries to get the details from each job separately. This is part of the next lines of code:

This renders us with a nice csv that we can browse quickly through in Excel…for convenience you can download the listings as of 23-10-2014 below. Good luck for your applications!


To leave a comment for the author, please follow the link and comment on his blog: fReigeist » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Why is my OS X Yosemite install taking so long?: an analysis

Thu, 2014-10-23 01:26

(This article was first published on On the lambda » ROn the lambda, and kindly contributed to R-bloggers)

Since the latest Mac OS X update, 10.10 "Yosemite", was released last Thursday, there have been complaints springing up online of the progress bar woefully underestimating the actual time to complete installation. More specifically, it appeared as if, for a certain group of people (myself included), the installer would stall out at "two minutes remaining" or "less than a minute remaining"–sometimes for hours.

In the vast majority of these cases, though, the installation process didn't hang, it was just performing a bunch of unexpected tasks that it couldn't predict.

During the install, striking "Command" + "L" would bring up the install logs. In my case, the logs indicated that the installer was busy right until the very last minute.

Not knowing very much about OS X's installation process and wanting to learn more, and wanting to answer why the installation was taking longer than the progress bar expected, I saved the log to a file on my disk with the intention of analyzing it before the installer automatically restarted my computer.

The log file from the Yosemite installer wasn't in a format that R (or any program) could handle natively so before we can use it, we have to clean/munge it. To do this, we'll write a program in the queen of all text-processing languages: perl.

This script will read the log file, line-by-line from standard input (for easy shell piping), and spit out nicely formatted tab-delimited lines.

#!/usr/bin/perl use strict; use warnings; # read from stdin while(<>){ chomp; my $line = $_; my ($not_message, $message) = split ': ', $line, 2; # skip lines with blank messages next if $message =~ m/^s*$/; my ($month, $day, $time, $machine, $service) = split " ", $not_message; print join("t", $month, $day, $time, $machine, $service, $message) . "n"; }

We can output the cleaned log file with this shell command:

echo "MonthtDaytTimetMachinetServicetMessage" > cleaned.log grep '^Oct' ./YosemiteInstall.log | grep -v ']: ' | grep -v ': }' | ./clean-log.pl >> cleaned.log

This cleaned log contains 6 fields: 'Month', 'Day', 'Time', 'Machine (host)', 'Service', and 'Message'. The installation didn't span days (it didn't even span an hour) so technically I didn't need the 'Month' and 'Day' fields, but I left them in for completeness' sake.


Let's set some options and load the libraries we are going to use:

# options options(echo=TRUE) options(stringsAsFactors=FALSE) # libraries library(dplyr) library(ggplot2) library(lubridate) library(reshape2)

Now we read the log file that I cleaned and add a few columns with correctly parsed timestamps using lubridate’s "parse_date_time()" function

yos.log <- read.delim("./cleaned.log", sep="t") %>% mutate(nice.date=paste(Month, Day, "2014", Time)) %>% mutate(lub.time=parse_date_time(nice.date, "%b %d! %Y! %H!:%M!:%S!", tz="EST"))

And remove the rows of dates that didn't parse correctly

yos.log <- yos.log[!is.na(yos.log$lub.time),] head(yos.log)

## Month Day Time Machine Service ## 1 Oct 18 11:28:23 localhost opendirectoryd ## 2 Oct 18 11:28:23 localhost opendirectoryd ## 3 Oct 18 11:28:23 localhost opendirectoryd ## 4 Oct 18 11:28:23 localhost opendirectoryd ## 5 Oct 18 11:28:23 localhost opendirectoryd ## 6 Oct 18 11:28:23 localhost opendirectoryd ## Message ## 1 opendirectoryd (build 382.0) launched - installer mode ## 2 Logging level limit changed to 'notice' ## 3 Initialize trigger support ## 4 created endpoint for mach service 'com.apple.private.opendirectoryd.rpc' ## 5 set default handler for RPC 'reset_cache' ## 6 set default handler for RPC 'reset_statistics' ## nice.date lub.time ## 1 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 2 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 3 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 4 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 5 Oct 18 2014 11:28:23 2014-10-18 11:28:23 ## 6 Oct 18 2014 11:28:23 2014-10-18 11:28:23

The first question I had was how long the installation process took

install.time <- yos.log[nrow(yos.log), "lub.time"] - yos.log[1, "lub.time"] (as.duration(install.time)) ## [1] "1848s (~30.8 minutes)"

Ok, about a half-hour.

Let's make a column for cumulative time by subtracting each row's time by the start time

yos.log$cumulative <- yos.log$lub.time - min(yos.log$lub.time, na.rm=TRUE)

In order to see what processes were taking the longest, we have to make a column for elapsed time. To do this, we can subtract each row's time from the time of the subsequent row.

yos.log$elapsed <- lead(yos.log$lub.time) - yos.log$lub.time # remove last row yos.log <- yos.log[-nrow(yos.log),]

Which services were responsible for the most writes to the log and what services took the longest? We can find out with the following elegant dplyr construct. While we're at it, we should add columns for percentange of the whole for easy plotting.

counts <- yos.log %>% group_by(Service) %>% summarise(n=n(), totalTime=sum(elapsed)) %>% arrange(desc(n)) %>% top_n(8, n) %>% mutate(percent.n = n/sum(n)) %>% mutate(percent.totalTime = as.numeric(totalTime)/sum(as.numeric(totalTime))) (counts)

## Source: local data frame [8 x 5] ## ## Service n totalTime percent.n percent.totalTime ## 1 OSInstaller 42400 1586 secs 0.9197197 0.867615 ## 2 opendirectoryd 3263 43 secs 0.0707794 0.023523 ## 3 Unknown 236 157 secs 0.0051192 0.085886 ## 4 _mdnsresponder 52 17 secs 0.0011280 0.009300 ## 5 OS 49 1 secs 0.0010629 0.000547 ## 6 diskmanagementd 47 7 secs 0.0010195 0.003829 ## 7 storagekitd 29 2 secs 0.0006291 0.001094 ## 8 configd 25 15 secs 0.0005423 0.008206

Ok, the "OSInstaller" is responsible for the vast majority of the writes to the log and to the total time of the installation. "opendirectoryd" was the next most verbose process, but its processes were relatively quick compared to the "Unknown" process' as evidenced by "Unknown" taking almost 4 times longer, in aggregate, in spite of having only 7% of "opendirectoryd"'s log entries.

We can more intuitively view the number-of-entries/time-taken mismatch thusly:

melted <- melt(as.data.frame(counts[,c("Service", "percent.n", "percent.totalTime")])) ggplot(melted, aes(x=Service, y=as.numeric(value), fill=factor(variable))) + geom_bar(width=.8, stat="identity", position = "dodge",) + ggtitle("Breakdown of services during installation by writes to log") + ylab("percent") + xlab("service") + scale_fill_discrete(name="Percent of", breaks=c("percent.n", "percent.totalTime"), labels=c("writes to logfile", "time elapsed"))

As you can see, the "Unknown" process took a disproportionately long time for its relatively few log entries; the opposite behavior is observed with "opendirectoryd". The other processes contribute very little to both the number of log entries and the total time in the installation process.

What were the 5 most lengthy processes?

yos.log %>% arrange(desc(elapsed)) %>% select(Service, Message, elapsed) %>% head(n=5)

## Service ## 1 OSInstaller ## 2 OSInstaller ## 3 Unknown ## 4 OSInstaller ## 5 OSInstaller ## Message ## 1 PackageKit: Extracting file:///System/Installation/Packages/Essentials.pkg (destination=/Volumes/Macintosh HD/.OSInstallSandboxPath/Root, uid=0) ## 2 System Reaper: Archiving previous system logs to /Volumes/Macintosh HD/private/var/db/PreviousSystemLogs.cpgz ## 3 kext file:///Volumes/Macintosh%20HD/System/Library/Extensions/JMicronATA.kext/ is in hash exception list, allowing to load ## 4 Folder Manager is being asked to create a folder (down) while running as uid 0 ## 5 Checking catalog hierarchy. ## elapsed ## 1 169 secs ## 2 149 secs ## 3 70 secs ## 4 46 secs ## 5 44 secs

The top processes were:

  • Unpacking and moving the contents of "Essentials.pkg" into what is to become the newsystem directory structure. This ostensibly contains items like all the updated applications (Safari, Mail, etc..). (almost three minutes)
  • Archiving the old system logs (two and a half minutes)
  • Loading the kernel module that allows the onboard serial ATA controller to work (a little over a minute)

Let's view a density plot of the number of writes to the log file during installation.

ggplot(yos.log, aes(x=lub.time)) + geom_density(adjust=3, fill="#0072B2") + ggtitle("Density plot of number of writes to log file during installation") + xlab("time") + ylab("")

This graph is very illuminating; the vast majority of log file writes were the result of very quick processes that took place in the last 15 minutes of the install, which is when the progress bar read that only two minutes were remaining.

In particular, there were a very large number of log file writes between 11:47 and 11:48; what was going on here?

# if the first time is in between the second two, this returns TRUE is.in <- function(time, start, end){ if(time > start && time < end) return(TRUE) return(FALSE) } the.start <- ymd_hms("14-10-18 11:47:00", tz="EST") the.end <- ymd_hms("14-10-18 11:48:00", tz="EST") # logical vector containing indices of writes in time interval is.in.interval <- sapply(yos.log$lub.time, is.in, the.start, the.end) # extract only these rows in.interval <- yos.log[is.in.interval, ] # what do they look like? silence <- in.interval %>% select(Message) %>% sample_n(7) %>% apply(1, function (x){cat("n");cat(x);cat("n")})

## ## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/tlpkg/tlpobj/featpost.tlpobj -> /Volumes/Macintosh HD/usr/local/texlive/2013/tlpkg/tlpobj Final name: featpost.tlpobj (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy) ## ## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl/pst-eucl.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/generic/pst-eucl Final name: pst-eucl.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy) ## ## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests/test_groupby.py -> /Volumes/Macintosh HD/Library/Python/2.7/site-packages/pandas-0.12.0_943_gaef5061-py2.7-macosx-10.9-intel.egg/pandas/tests Final name: test_groupby.py (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy) ## ## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis/uct10.clo -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/tex/latex/ucthesis Final name: uct10.clo (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy) ## ## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book/wkmgr1.tex -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/przechlewski-book Final name: wkmgr1.tex (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy) ## ## WARNING : ensureParentPathExists: Created `/Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/doc/latex/moderntimeline' w/ { ## ## (NodeOp) Move /Volumes/Macintosh HD/Recovered Items/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj/mrjkx.pfb -> /Volumes/Macintosh HD/usr/local/texlive/2013/texmf-dist/fonts/type1/wadalab/mrj Final name: mrjkx.pfb (Flags used: kFSFileOperationDefaultOptions,kFSFileOperationSkipSourcePermissionErrors,kFSFileOperationCopyExactPermissions,kFSFileOperationSkipPreflight,k_FSFileOperationSuppressConversionCopy)

Ah, so these processes are the result of the installer having to move files back into the new installation directory structure. In particular, the vast majority of these move operations are moving files related to a program called "texlive". I'll explain why this is to blame for the inaccurate projected time to completion in the next section.

But lastly, let's view a faceted density plot of the number of log files writes by process. This might give us a sense of what steps go on as the installation progresses by showing us with processes are most active.

# reduce number of service to a select few of the most active smaller <- yos.log %>% filter(Service %in% c("OSInstaller", "opendirectoryd", "Unknown", "OS")) ggplot(smaller, aes(x=lub.time, color=Service)) + geom_density(aes( y = ..scaled..)) + ggtitle("Faceted density of log file writes by process (scaled)") + xlab("time") + ylab("")

This shows that no one process runs consistently throughout the entire installation process, but rather that the process run in spurts.

the answer
The vast majority of Mac users don't place strange files in certain special system-critical locations like '/usr/local/' and '/Library/'. Among those who do, though, these directories are littered with hundreds and hundreds of custom files that the installer doesn't and can't have prior knowledge of.

In my case, and probably many others, the estimated time-to-completion was inaccurate because the installer couldn't anticipate needing to copy back so many files to certain special directories after unpacking the contents of the new OS. Additionally, for each of these copied files, the installer had to make sure the subdirectories had the exact same meta-data (permissions, owner, reference count, creation date, etc…) as before the installation began. This entire process added many minutes to the procedure at a point when the installer thought it was pretty much done.

What were some of the files that the installer needed to copy back? The answer will be different for each system but, as mentioned above, anything placed '/usr/local' and '/Library' directories that wasn't Apple-supplied needed to be moved and moved back.

/usr/local/ is used chiefly for user-installed software that isn't part of the OS distribution. In my case, my /usr/local contained a custom compliled Vim; ClamXAV, a lightweight virus scanner that I use only for the benefit of my Windows-using friends; and texlive, software for the TeX typesetting system. texlive was, by far, the biggest time-sink since it had over 123,491 files.

In addition to these programs, many users might find that the Homebrew package manager is to blame for their long installation process, since this software also uses the /usr/local prefix (although it probably should not).

Among other things, this directory holds (subdirectories that hold) modules and packages that the Apple-supplied Python, Ruby, and Perl uses. If you use these Apple-supplied versions of these languages and you install your own packages/modules using super-user privileges, the new packages will go into this directory and it will appear foreign to the Yosemite installer.

To get around this issue, either install packages/modules in a local (non-system) library, or use alternate versions of these programming languages that you either download and install yourself, or use MacPorts to install.


You can find all the code and logs that I used for this analysis in this git repository

This post is also available as a RMarkdown report here

share this:

To leave a comment for the author, please follow the link and comment on his blog: On the lambda » ROn the lambda. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

simmer 2.0: a performance boost & revised syntax

Wed, 2014-10-22 16:57

(This article was first published on FishyOperations » R, and kindly contributed to R-bloggers)

It was very clear that the performance of the first simmer version was lagging behind. The 2.0 release introduces a C++ based engine and comes with a significant performance boost versus the previous version. While the performance can always be further improved I’m currently quite satisfied with it. Future refinement of the code base will however focus on further improving efficiency.

The syntax also had a massive overhaul. Most notably, a trajectory is no longer based on a data frame but build up using the magrittr-style piping philosophy. The code, installation instructions and documentation can be found at the simmer GitHub repo.

As an example a simple bank clerk – customer interaction process is simulated.

library(simmer) bank_customer_trajectory<- create_trajectory("bank customer") %>% ## see the bank clerk add_seize_event("clerk",1) %>% add_timeout_event("rpois(1, 20)") %>% add_release_event("clerk",1) %>% ## with a 50% chance skip the next three steps add_skip_event("sample(c(0,3), 1)") %>% ## with a 50% chance see the bank director add_seize_event("director",1) %>% add_timeout_event("rpois(1, 15)") %>% add_release_event("director",1) bank_process<- # create simulation with 50 replications & a max time of 360 create_simulator("bank process", n = 50, until = 360) %>% add_resource("clerk", 1) %>% add_resource("director", 1) %>% # add 30 customers with a given inter-arrival time add_entities_with_interval(trajectory = bank_customer_trajectory, n = 30, name_prefix = "customer", interval = "rpois(1, 20)") %>% # run the simulation simmer()

Once the simulation has finished, we can have a look at the results.

# have a look at the flow time plot_evolution_entity_times(bank_process, type = "flow_time")

# have a look at the overall resource utilization plot_resource_utilization(bank_process, c("clerk","director"))

# have a look at the in use moments of the director for replication 5 plot_resource_usage(bank_process, "director", replication_n = 5)

As always, comments, questions, tips & suggestions are very welcome. Either post a GitHub issue or contact me by mail at bartsmeets86@gmail.com

The post simmer 2.0: a performance boost & revised syntax appeared first on FishyOperations.

To leave a comment for the author, please follow the link and comment on his blog: FishyOperations » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Postive Feedback in R with a Little Javascript

Wed, 2014-10-22 16:03

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

Let’s face it, sometimes the struggle in R can become frustrating, depressing, daunting, or just monotonous.  For those moments when you need a little positive feedback, some encouragement, or a pat on the back, I thought this might help.  Maybe I should make this into a package.

I found this from Sweet Alert for Bootstrap forked from Tristan Edwards non-bootstrap SweetAlert.  This builds on the technique used in my previous post SVG + a little extra (d3.js) in RStudio Browser | No Pipes This Time.

# give yourself some positive feedback in R # as you toil away on some difficult, but worthwhile task # uses javascript sweet-alert https://github.com/t4t5/sweetalert library(htmltools) library(pipeR) tagList( tags$script( ' document.addEventListener("DOMContentLoaded", function(event) { swal("Good job! Brilliant!", "You\'re doing worthwhile things.", "success") }); ' ) ) %>>% attachDependencies( htmlDependency( name="sweet-alert" ,version="0.2.1" ,src=c("href"= "http://timelyportfolio.github.io/sweetalert/lib" ) ,script = "sweet-alert.min.js" ,style = "sweet-alert.css" ) ) %>>% html_print  

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New BayesFactor version 0.9.9 released to CRAN

Wed, 2014-10-22 15:01

(This article was first published on BayesFactor: Software for Bayesian inference, and kindly contributed to R-bloggers)

Today I submitted a new release of BayesFactor, version 0.9.9, to CRAN. Among the new features are support for contingency table analyses, via the function contingencyTableBF, and analysis of a single proportion, via the function proportionBF. Other features and fixes include:

  • Added "simple" argument to ttest.tstat, oneWayAOV.Fstat, and linearReg.R2stat; when TRUE, return only the Bayes factor (not the log BF and error)
  • When sampling Bayes factors, recompute() now increases the precision of BayesFactor objects, rather than simply recomputing them. Precision from new samples is added
  • Added Hraba and Grant (1970) data set; see ?raceDolls
  • Added model.matrix method for BayesFactor objects; allows for extracting the design matrix used for an analysis
  • recompute() now has multicore and callback support, as intended
  • Moved many backend functions to Rcpp from R C API
  • t test samplers now sample from interval null hypotheses and point null hypotheses where appropriate
  • fixed bug in in meta t test sampler which wouldn't allow sampling small numbers of MCMC samples

To leave a comment for the author, please follow the link and comment on his blog: BayesFactor: Software for Bayesian inference. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

translate2R: Migrate SPSS scripts to R at the push of a button

Wed, 2014-10-22 09:47

(This article was first published on eoda, R und Datenanalyse » eoda english R news, and kindly contributed to R-bloggers)

After their presentation at the international “user!” conference, data analysis specialist eoda starts the public alpha testing of translate2R. With the start of alpha testing the innovative migration solution by the company hailing from Kassel discards the working title “translateR” and takes on the final product brand name “translate2R”. translate2R is a service for the automated translation of SPSS® syntax to R code, therefore supporting data analysts with a quick and low-risk migration to R.

The manual translation of many, frequently rather complex SPSS scripts often presents itself as a tedious and error-prone task, and represents a rather large obstacle for many analysts and companies to migrate to a modern, open source data management and analysis tool like R. With translate2R this hurdle will be diminished substantially. “Many companies have recognized the advantages of R. Still, old scripts with thousands of rows of code in other statistics languages pose a large barrier and an almost incalculable risk when migrating to R. translate2R will help limiting this risk distinctively”, Oliver Bracht, CTO of eoda, explains the value of translate2R for users.

translate2R: user friendly and intuitive

translate2R, developed by a cooperation between eoda and the University of Kassel and sponsored by the LOEWE program of the German state of Hesse, will be provided as both, a cloud service and on premise. The translation will be processed in two steps for the user: first, the SPSS syntax will be imported into the user friendly browser frontend of translate2R, then the corresponding R functions will be generated by the push of a button. Finally, the generated code will be executed in R. The accompanying R package translateSPSS2R contains all required functions for the code to run appropriately and is compatible will almost every possible installation of R. Besides migration projects to R, translate2R also support programmers who are firm with SPSS and who are looking for a comforting introduction into R. Vice versa, translate2R also enables users to comfortably implement common SPSS functions in R.

The user friendly browser frontend of translate2R.

translate2R taps into „the most powerful programming language for data analysis“

With the migration to R almost innumerable opportunities for data analysis will open up to its users. The global community of developers is a tireless motor for steady progress and enables a level of functionality, quality and actuality for data mining and predictive analytics never seen before. R is open source and available free of any licensing costs. Moreover, through its modern programming concepts, R supports the generation of future proof and maintainable scripts for a plethora of analytical tasks. After a successful start of translate2R with the SPSS® syntax, an expansion to other statistics languages is planned.

Anyone interested in translate2R can use the free alpha testing to gain first insights in this groundbreaking migration solution.

To leave a comment for the author, please follow the link and comment on his blog: eoda, R und Datenanalyse » eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

How the MKL speeds up Revolution R Open

Wed, 2014-10-22 08:43

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

Last week we announced the availability of Revolution R Open, an enhanced distribution of R.  One of the enhancements is the inclusion of high performance linear algebra libraries, specifically the Intel MKL. This library significantly speeds up many statistical calculations, e.g. the matrix algebra that forms the basis of many statistical algorithms.

Several years ago, David Smith wrote a blog post about multithreaded R, where he explored the benefits of the MKL, in particular on Windows machines.

In this post I explore whether anything has changed. 

What is the MKL?

To best use the power available in the machines of today, Revolution R Open is installed by default with the Intel Math Kernel Library (MKL), which provides BLAS and LAPACK library functions used by R. Intel MKL makes it possible for so many common R operations to use all of the processing power available.

The MKL's default behavior is to use as many parallel threads as there are available cores. There’s nothing you need to do to benefit from this performance improvement — not a single change to your R script is required.

However, you can still control or restrict the number of threads using the setMKLthreads() function from the Revobase package delivered with Revolution R Open. For example, you might want to limit the number of threads to reserve some of the processing capacity for other activities, or if you’re doing explicit parallel programming with the ParallelR suite or other parallel programming tools.

You can set the maximum number of threads as follows:


Where the <value> is the maximum number of parallel threads, not to exceed the number of available cores. 

Testing the MKL on matrix operations

Compared to open source R, the MKL offers significant performance gains, particularly on Windows.

Here are the results of 5 tests on matrix operations, run on a Samsung laptop with an Intel i7 4-core CPU. From the graphic you can see that a matrix multiplication runs 27 times faster with the MKL than without, and linear discriminant analysis is 3.6 times faster.

You can replicate the same tests by using this code:



Simon Urbanek's benchmark

Another famous benchmark was published by Simon Urbanek, one of the members of R-core.  You can find his code at Simon's benchmark page.  His benchmark consists of three different classes of test:

  • Matrix calculation
    • This includes tests for creation of vectors, sorting, computing the cross product and linear regression.
    • In this category, the MKL substantially speeds up calculation of cross product (~26x) and linear regression (~20x)
  • Matrix functions
    • This category includes test for computations that heavily involves matrix manipulation, e.g. fast fourier transforms, computing eigen values and the matrix inverse.
    • On some of these, the MKL really shines, notably for cholesky decomposition (~16x).
  • Programmation
    • The final category includes tasks such as looping and recursion
    • These functions do not activate any mathematical functionality, and thus the MKL makes no difference at all

I compared the total execution time of the benchmark script in RRO (with MKL) and R.  Using Revolution R Open, the benchmark tests completed in 47.7 seconds.  This compared to ~176 seconds using R-3.1.1 on the same machine.

To replicate these results you can use the following script runs (sources) his code directly from the URL and captures the total execution time:



Detailed results

Here is a summary of each of the individual tests:

  R-3.1.1 RRO Performance gain I. Matrix calculation       Create, transpose and deform matrix 1.01 1.01 0.0 Matrix computation 0.40 0.40 0.0 Sort random values 0.72 0.74 0.0 Cross product 11.50 0.42 26.4 Linear regression 5.56 0.25 20.9         II. Matrix functions       Fast Fourier Transform 0.45 0.47 0.0 Compute eigenvalues 0.74 0.39 0.9 Calculate determinant 2.87 0.24 10.8 Cholesky decomposition 4.50 0.25 16.8 Matrix inverse 2.71 0.25 9.9         III. Programmation       Vector calculation 0.67 0.67 0.0 Matrix calculation 0.26 0.26 0.0 Recursion 0.95 1.06 -0.1 Loops 0.43 0.43 0.0 Mixed control flow 0.41 0.37 0.1         Total test time 165.60 47.72 2.5   Conclusion and caveats

The Intel MKL makes a notable difference for many matrix computations.  When running the Urbanek benchmark using the MKL on Windows, you can expect a performance gain of ~2.5x.

The caveat is that different the standard R distribution on different operating systems use different math libraries.  For example, R on Mac OSx uses the ATLAS blas, which gives you comparable performance to the MKL.

Download Revolution R Open

To find out more about Revolution R Open, go to http://mran.revolutionanalytics.com/open/

You can download RRO at http://mran.revolutionanalytics.com/download/


To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Prediction intervals too narrow

Wed, 2014-10-22 01:25

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

Almost all prediction intervals from time series models are too narrow. This is a well-known phenomenon and arises because they do not account for all sources of uncertainty. In my 2002 IJF paper, we measured the size of the problem by computing the actual coverage percentage of the prediction intervals on hold-out samples. We found that for ETS models, nominal 95% intervals may only provide coverage between 71% and 87%. The difference is due to missing sources of uncertainty.

There are at least four sources of uncertainty in forecasting using time series models:

  1. The random error term;
  2. The parameter estimates;
  3. The choice of model for the historical data;
  4. The continuation of the historical data generating process into the future.

When we produce prediction intervals for time series models, we generally only take into account the first of these sources of uncertainty. It would be possible to account for 2 and 3 using simulations, but that is almost never done because it would take too much time to compute. As computing speeds increase, it might become a viable approach in the future.

Even if we ignore the model uncertainty and the DGP uncertainty (sources 3 and 4), and just try to allow for parameter uncertainty as well as the random error term (sources 1 and 2), there are no closed form solutions apart from some simple special cases.

One such special case is an ARIMA(0,1,0) model with drift, which can be written as


where is a white noise process. In this case, it is easy to compute the uncertainty associated with the estimate of , and then allow for it in the forecasts.

This model can be fitted using either the Arima function or the rwf function from the forecast package for R. If the Arima function is used, the uncertainty in is ignored, but if the rwf function is used, the uncertainty in is included in the prediction intervals. The difference can be seen in the following simulated example.

library(forecast)   set.seed(22) x <-ts(cumsum(rnorm(50, -2.5, 4)))   RWD.x <- rwf(x, h=40, drift=TRUE, level=95) ARIMA.x <- Arima(x, c(0,1,0), include.drift=TRUE)   plot(forecast(ARIMA.x, h=40, level=95)) lines(RWD.x$lower, lty=2) lines(RWD.x$upper, lty=2)

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

7 new R jobs (for October 21st 2014)

Tue, 2014-10-21 14:55

This is the bimonthly R Jobs post (for 2014-10-21), based on the R-bloggers’ sister website: R-users.com.

If you are an employer who is looking to hire people from the R community, please visit this link to post a new R job (it’s free, and registration takes less than 10 seconds).

If you are a job seekers, please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).


  1. Temporary
    Statistician for a six-month project contract in Milano, Italy.
    Quantide srl – Posted by nicola.sturaro
    Lombardy, Italy
    21 Oct2014
  2. Full-Time
    Team leader Data Analysis of wind turbine data
    Nordex – Posted by Nordex
    Hamburg, Germany
    20 Oct2014
  3. Freelance
    Data analysis R expert
    startup – Posted by Roberth
    Valencian Community, Spain
    20 Oct2014
  4. Full-Time
    Modeller for projects in the hydroelectric and mining industries
    Ecocsape Environmental Consultants Ltd. – Posted bysilentconquest
    British Columbia, Canada
    20 Oct2014
  5. Full-Time
    BI Analyst for Risk & Finance
    Optimum Credit Ltd – Posted by Stephanie Locke
    Wales, United Kingdom
    20 Oct2014
  6. Freelance
    Data genius, modeller, creative analyst
    Metavue – Posted by ninthtoe
    20 Oct2014
  7. Full-Time
    Data Modeller for working with leading universities and Plc’s,
    In Touch Limited – Posted by adunaic
    England, United Kingdom
    20 Oct2014

Categories: Methodology Blogs

Example 2014.12: Changing repeated measures data from wide to narrow format

Tue, 2014-10-21 14:11

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

Data with repeated measures often come to us in the "wide" format, as shown below for the HELP data set we use in our book. Here we show just an ID, the CESD depression measure from four follow-up assessments, plus the baseline CESD.


1 1 7 . 8 5 49
2 2 11 . . . 30
3 3 14 . . 49 39
4 4 44 . . 20 15
5 5 26 27 15 28 39

Frequently for data analysis we need to convert the data to the "long" format, with a single column for the repeated time-varying CESD measures and column indicating the time of measurement. This is needed, for example, in SAS proc mixed or in the lme4 package in R. The data should look something like this:

Obs ID time CESD cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5

In section 2.3.7 (2nd Edition) we discuss this problem, and we provide an example in section 7.10.9. Today we're adding a blog post to demonstrate some handy features in SAS and how the problem can be approached using plain R and, alternatively, using the new-ish R packages dplyr and tidyr, contributed by Hadley Wickham.

We'll begin by making a narrower data frame with just the columns noted above. We use the select() function from the dplyr package to do this; the syntax is simply to provide the the name of the input data frame as the first argument and then the names of the columns to be included in the output data frame. We use this function instead of the similar base R function subset(..., select=) because of dplyr's useful starts_with() function. This operates on the column names as character vectors in a hopefully obvious way.

wide = select(ds, id, starts_with("cesd"))

Now we'll convert to the long format. The standard R approach is to use the reshape() function. The documentation for this is a bit of a slog, and the function can generate error messages that are not so helpful. But for simple problems like this one, it works well.
long = reshape(wide, varying = c("cesd1", "cesd2", "cesd3", "cesd4"),
v.names = "cesd_tv",
idvar = c("id", "cesd"), direction="long")
long[long$id == 1,]

id cesd time cesd_tv
1.49.1 1 49 1 7
1.49.2 1 49 2 NA
1.49.3 1 49 3 8
1.49.4 1 49 4 5

In the preceding, the varying parameter is a list of columns which vary over time, while the id.var columns appear at each time. The v.names parameter is the name of the column which will hold the values of the varying variables.

Another option would be to use base R knowledge to separate, rename, and then recombine the data as follows. The main hassle here is renaming the columns in each separate data frame so that they can be combined later.
c1 = subset(wide, select= c(id, cesd, cesd1))
c1$time = 1
names(c1)[3] = "cesd_tv"

c2 = subset(wide, select= c(id, cesd, cesd2))
c2$time = 2
names(c2)[3] = "cesd_tv"

c3 = subset(wide, select= c(id, cesd, cesd3))
c3$time = 3
names(c3)[3] = "cesd_tv"

c4 = subset(wide, select= c(id, cesd, cesd4))
c4$time = 4
names(c4)[3] = "cesd_tv"

long = rbind(c1,c2,c3,c4)

id cesd cesd_tv time
1 1 49 7 1
454 1 49 NA 2
907 1 49 8 3
1360 1 49 5 4

This is cumbersome, but effective.

More interesting is to use the tools provided by dplyr and tidyr.
gather(wide, key=names, value=cesd_tv, cesd1,cesd2,cesd3,cesd4) %>%
mutate(time = as.numeric(substr(names,5,5))) %>%
arrange(id,time) -> long


id cesd names cesd_tv time
1 1 49 cesd1 7 1
2 1 49 cesd2 NA 2
3 1 49 cesd3 8 3
4 1 49 cesd4 5 4
5 2 30 cesd1 11 1
6 2 30 cesd2 NA 2

The gather() function takes a data frame (the first argument) and returns new columns named in the key and value parameter. The contents of the columns are the names (in the key) and the values (in the value) of the former columns listed. The result is a new data frame with a row for every column in the original data frame, for every row in the original data frame. Any columns not named are repeated in the output data frame. The mutate function is like the R base function transform() but has some additional features and may be faster in some settings. Finally, the arrange() function is a much more convenient sorting facility than is available in standard R. The input is a data frame and a list of columns to sort by, and the output is a sorted data frame. This saves us having to select out a subject to display

The %>% operator is a "pipe" or "chain" operator that may be familiar if you're a *nix user. It feeds the result of the last function into the next function as the first argument. This can cut down on the use of nested parentheses and may make reading R code easier for some folks. The effect of the piping is that the mutate() function should be read as taking the result of the gather() as its input data frame, and sending its output data frame into the arrange() function. For Ken, the right assignment arrow (-> long) makes sense as a way to finish off this set of piping rules, but Nick and many R users would prefer to write this as long = gather... or long <- gather.. , etc.

In SAS, we'll make the narrow data set using the keep statement in the data step, demonstrating meanwhile the convenient colon operator, that performs the same function provided by starts_with() in dplyr.

data all;
set "c:/book/help.sas7bdat";

data wide;
set all;
keep id cesd:;

The simpler way to make the desired data set is with the transpose procedure. Here the by statement forces the variables listed in that statement not to be transposed. The notsorted options save us having to actually sort the variables. Otherwise the procedure works like gather(): each transposed variable becomes a row in the output data set for every observation in the input data set. SAS uses standard variable names for gather()'s key (SAS: _NAME_)and value (SAS: COL1) though these can be changed.

proc transpose data = wide out = long_a;
by notsorted id notsorted cesd;

data long;
set long_a;
time = substr(_name_, 5);
rename col1=cesd_tv;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv;

Obs ID time CESD cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5

As with R, it's trivial, though somewhat cumbersome, to generate this effect using basic coding.
data long;
set wide;
time = 1; cesd_tv = cesd1; output;
time = 2; cesd_tv = cesd2; output;
time = 3; cesd_tv = cesd3; output;
time = 4; cesd_tv = cesd4; output;

proc print data = long;
where id eq 1;
var id time cesd cesd_tv;

Obs ID time CESD cesd_tv

1 1 1 49 7
2 1 2 49 .
3 1 3 49 8
4 1 4 49 5

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To leave a comment for the author, please follow the link and comment on his blog: SAS and R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Modeling Plenitude and Speciation by Jointly Segmenting Consumers and their Preferences

Tue, 2014-10-21 13:52

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

In 1993, when music was sold in retail stores, it may have been informative to ask about preference across a handful of music genre. Today, now that the consumer has seized control and the music industry has responded, the market has exploded into more than a thousand different fragmented pairings of artists and their audiences. Grant McCracken, the cultural anthropologist, refers to such proliferation as speciation and the resulting commotion as plenitude. As with movies, genre become microgenre forcing recommender systems to deal with more choices and narrower segments.
This mapping from the website Every Noise at Once is constantly changing. As the website explains, there is a generating algorithm with some additional adjustments in order to make it all readable, and it all seems to work as an enjoyable learning interface. One clicks on the label to play a music sample. Then, you can continue to a list of artists associated with the category and hear additional samples from each artist. Although the map seems to have interpretable dimensions and reflects similarity among the microgenre, it does not appear to be a statistical model in its present form.
At any given point in time, we are stepping into a dynamic process of artists searching for differentiation and social media seeking to create new communities who share at least some common preferences. Word of mouth is most effective when consumers expect new entries and when spreading the word is its own reward. It is no longer enough for a brand to have a good story if customers do not enjoy telling that story to others. Clearly, this process is common to all product categories even if they span a much smaller scale. Thus, we are looking for a scalable statistical model that captures the dynamics through which buyers and sellers come to a common understanding. 
Borrowing a form of matrix factorization from recommender systems, I have argued in previous posts for implementing this kind of joint clustering of the rows and columns of a data matrix as a replacement for traditional forms of market segmentation. We can try it with a music preference dataset from the R package prefmod. Since I intend to compare my finding with another analysis of the same 1993 music preference data using the new R package RCA and reported in the American Journal of Sociology, we will begin by duplicating the few data modifications that were made in that paper (see the R code at the end of this post). 
In previous attempts to account for music preferences, psychologists have focused on the individual and turned to personality theory for an explanation. For the sociologist, there is always the social network. As marketing researchers, we will add the invisible hand of the market. What is available? How do consumers learn about the product category and obtain recommendations? Where is it purchased? When and where is it consumed? Are others involved (public vs private consumption)?
The Internet opens new purchase pathways, encourages new entities, increases choice and transfers control to the consumer. The resulting postmodern market with its plenitude of products, services, and features cannot be contained within a handful of segments. Speciation and micro-segmentation demand a model that reflects the joint evolution where new products and features are introduced to meet the needs of specific audiences and consumers organize their attention around those microgenre. Nonnegative matrix factorization (NMF) represents this process with a single set of latent variables describing both the rows and the columns at the same time.
After attaching the music dataset, NMF will produce a cluster heatmap summarizing the "loadings" of the 17 music genre (columns below) on the five latent features (rows below): Blues/Jazz, Heavy Metal/Rap, Country/Bluegrass, Opera/Classical, and Rock. The dendrogram at the top displays the results of a hierarchical clustering. Although there are five latent features, we could use the dendrogram to extract more than five music genre clusters. For example, Big Band and Folk music seem to be grouped together, possibly as a link from classical to country. In addition, Gospel may play a unique role linking country and Blues/Jazz. Whatever we observe in the columns will need to be verified by examining the rows. That is, one might expect to find a segment drawn to country and jazz/blues who also like gospel.

We would have seen more of the lighter colors with coefficients closer to zero had we found greater separation. Yet, this is not unexpected given the coarseness of music genre. As we get more specific, the columns become increasingly separated by consumers who only listen to or are aware of a subset of the available alternatives. These finer distinctions define today's market for just about everything. In addition, the use of a liking scale forces us to recode missing values to a neutral liking. We would have preferred an intensity scale with missing values coded as zeros because they indicate no interaction with the genre. Recoding missing to zero is not an issue when zero is the value given to "never heard of" or unaware.
Now, a joint segmentation means that listeners in the rows can be profiled using the same latent features accounting for covariation among the columns. Based on the above coefficient map, we expect those who like opera to also like classical music so that we do not require two separate scores for opera and classical but only one latent feature score. At least this is what we found with this data matrix. A second heatmap enables us to take a closer look at over 1500 respondents at the same time.

We already know how to interpret this heatmap because we have had practice with the coefficients. These colors indicate the values of the mixing weights for each respondent. Thus, in the middle of the heatmap you can find a dark red rectangle for latent feature #3, which we have already determined to represent country/bluegrass. These individuals give the lowest possible rating to everything except for the genre loading on this latent feature. We do not observe that much yellow or lighter colors in this heatmap because less than 13% of the responses fell into the lowest box labeled "dislike very much." However, most of the lighter regions are where you might expect them to be, for example, heavy metal/rap (#2), although we do uncover a heavy metal segment at the bottom of the figure.

Measuring Attraction and Ignoring Repulsion

We often think of liking as a bipolar scale, although what determines attraction can be different from what drives repulsion. Music is one of those product categories where satisfiers and dissatisfiers tend to be different. Negative responses can become extreme so that preference is defined by what one dislikes rather than what one likes. In fact, it is being forced to listen to music that we do not like that may be responsible for the lowest scores (e.g., being dragged to the opera or loud music from a nearby car). So, what would we find if we collapsed the bottom three categories and measured only attraction on a 3-point scale with 0=neutral, dislike or dislike very much, 1=like, and 2=like very much?

NMF thrives on sparsity, so increasing the number of zeros in the data matrix does not stress the computational algorithm. Indeed, the latent features become more separated as we can see in the coefficient heatmap. Gospel stands alone as its own latent feature. Country and bluegrass remain, as does opera/classical, blues/jazz, and rock. When we "remove" dislike for heavy metal and rap, heavy metal moves into rock and rap floats with reggae between jazz and rock. The same is true for folk and easy mood music, only now both are attractive to country and classical listeners.

More importantly, we can now interpret the mixture weights for individual respondents as additive attractors so that the first few rows are the those with interest in all the musical genre. In addition, we can easily identify listeners with specific interests. As we continue to work our way down the heatmap, we find jazz/blues(#4), followed by rock(#5) and a combination of jazz and rock. Continuing, we see country(#2) plus rock and country alone, after which is a variety of gospel (#1) plus some other genre. We end with opera and classical music, by itself and in combination with jazz.

Comparison with the Cultural Omnivore Hypothesis

As mentioned earlier, we can compare our findings to a published study testing whether inclusiveness rules tastes in music (the eclectic omnivore) or whether cultural distinctions between highbrow and lowbrow still dominate. Interestingly, the cluster analysis is approached as a graph-partitioning problem where the affinity matrix is defined as similarity in the score pattern regardless of mean level. All do not agree with this calculation, and we have a pair of dueling R packages using different definitions of similarity (the RCA vs. the CCA).

None of this is news for those of us who perform cluster analysis using the affinity propagation R package apcluster, which enables several different similarity measures including correlations (signed and unsigned). If you wish to learn more, I would suggest starting with the Orange County R User webinar for apcluster. The quality and breadth of the documentation will flatten your learning curve.

Both of the dueling R packages argue that preference similarity ought to be defined by the highs and lows in the score profiles ignoring the mean ratings for different individuals. This is a problem for marketing since consumers who do not like anything ought to be treated differently from consumers who like everything. One is a prime target and the other is probably not much of a user at all.

Actually, if I were interesting in testing the cultural omnivore hypothesis, I would be better served by collecting familiarity data on a broader range of more specific music genre, perhaps not as detailed as the above map but more revealing than the current broad categories. The earliest signs of preference can be seen in what draws our attention. Recognition tends to be a less obtrusive measure than preference, and we can learn a great deal knowing who visits each region in the music genre map and how long they stayed.

NMF identifies a sizable audience who are familiar with the same subset of music genre. These are the latent features, the building blocks as we have seen in the coefficient heatmaps. The lowbrow and the highbrow each confine themselves to separate latent features, residing in gated communities within the music genre map and knowing little of the other's world. The omnivore travels freely across these borders. Such class distinctions may be even more established in the cosmetics product category (e.g., women's makeup). Replacing genre with brand, you can read how this was handled in a prior post using NMF to analyze brand involvement.

R code to perform all the analyses reported in this post
# keep only the 17 genre used
# in the AMJ Paper (see post)
# calculate number of missing values for each
# respondent and keep only those with no more
# than 6 missing values
miss<-apply(prefer,1,function(x) sum(is.na(x)))
# run frequency tables for all the variables
apply(prefer,2,function(x) table(x,useNA="always"))
# recode missing to the middle of the 5-point scale
# reverse the scale so that larger values are
# associated with more liking and zero is
# the lowest value
# longer names are easier to interpret
fit<-nmf(prefer, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)
# recode bottom three boxes to zero
# and rerun NMF
# need to remove respondents with all zeros
fit<-nmf(prefer2, 5, "lee", nrun=30)
coefmap(fit, tracks=NA)
basismap(fit, tracks=NA)Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Kernel Density Estimation with Ripley’s Circumferential Correction

Tue, 2014-10-21 10:36

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

The revised version of the paper Kernel Density Estimation with Ripley’s Circumferential Correction with Ewen Gallic is now online, on hal.archives-ouvertes.fr/.

In this paper, we investigate (and extend) Ripley’s circumference method to correct bias of density estimation of edges (or frontiers) of regions. The idea of the method was theoretical and difficult to implement. We provide a simple technique — based of properties of Gaussian kernels — to efficiently compute weights to correct border bias on frontiers of the region of interest, with an automatic selection of an optimal radius for the method. We illustrate the use of that technique to visualize hot spots of car accidents and campsite locations, as well as location of bike thefts.

There are new applications, and new graphs, too

Most of the codes can be found on https://github.com/ripleyCorr/Kernel_density_ripley (as well as datasets).

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

methylKit for bisulfite sequencing data analysis

Tue, 2014-10-21 05:48

(This article was first published on CHITKA, and kindly contributed to R-bloggers)

I have been relying upon methylKit, an R package for my RRBS data analysis. It is one the most highly cited R package for analysing bisulifite sequencing data. It is straight forward to install and it's vignette details all the major steps in the bisulfite analysis with clarity. Altuna Akalin, the author of the methylKit has been actively supporting (via google groups) the issues faced by the users in implementing methylKit. Overall, methylKit could also be used with little knowledge of R. Interestingly, working with methylKit also helps laboratory researchers learn R.

As with other bisulfite sequencing data analysis packages, methylkit takes charge once the bisulfite reads are aligned to the genome. Here are the tasks one can implement using methylKit:

  • Extract methylation information from aligned data from mappers like Bismark
  • Alternatively, one can read the methylation information of mapped cytosines easily from other mappers like BSMAP  or any other bisulfite mapper in a specified format 
  • Normalize the CpGs covered by removing the CpGs that have excess coverage due to over amplification/PCR duplication
  • Calculate methylation status of each CpG covered (or specified regions or tiles across genome) and export them into bed or bedcoverage formats for visualization in a genome browser. methylKit also enables merging of strand coverage.
  • Enables the consideration of replicates among control and test samples
  • Calculate differential methylation either at single CpG levels or regions/tiles covered across the control and test samples
  • Facilitates PCA and cluster analysis to identify the overall relation among the samples from methylation point of view
  • Enables annotation of differential methylation across CpG islands/shores and multiple genic regions of interest such as promoters, exons, introns......
Any genomic analysis is highly customized after a certain number of basic steps. One has to build the customization by utilizing the options among several packages and bridging the gaps by fine tuning the input and output file formats. methylKit does a fairly good job by facilitating the coercion of methylKit objects into GenomicRanges objects such as GRanges. This feature enables seamless integration into multiple packages from bioconductor. In the future posts, I will detail some examples and R scripts that facilitate extension to methylKit analysis.

To leave a comment for the author, please follow the link and comment on his blog: CHITKA. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Easy Clustered Standard Errors in R

Tue, 2014-10-21 00:27

(This article was first published on R for Public Health, and kindly contributed to R-bloggers)

To leave a comment for the author, please follow the link and comment on his blog: R for Public Health. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs