R bloggers
URL Originality Analysis
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
By Chris Campbell, Senior Consultant.
When I first heard about Twitter, it was described to me as love, life and the story of the world, summarized in 140 characters or less. Each snippet of information is like a beautiful haiku, perfectly capturing the essence of events. A moment captured in amber.
Of course now I know what Twitter is, perhaps less idyllic, but still interesting. Chris Musselle (who wrote the first blog post in this series) had been doing some investigation of what data science tools people are talking about. I was able to grab the data as a simple CSV.
A striking feature of the dataset that Chris captured using his Twittermining suite is that most tweets contain a URL. In fact, some tweets contain more than one URL. I wrote a simple function to list these, and then looked at the table of counts. More than 19,000 URLs were tweeted during 22,500 tweets.
Rather than telling a story in 140 characters, 82% of tweets are cheating the character limit, and using twitter as advertising for rich media. So if the storytelling isn’t happening on Twitter, where is it happening? And what stories are being told? Are users tweeting links that they’ve found on Twitter, or are they sharing links which they have discovered elsewhere?
To determine the originality of a URL, we can estimate the distribution of novel posts and shares using Twitter’s automatic URL shortening feature. Many URLs are rather long, depending on the website file structure, and could easily exceed 140 characters. To allow URLs to be shared, all URLs posted in Twitter are automatically converted to short form. For example,
http://www.rbloggers.com/amiadatascientist/
was updated to
http://t.co/XQfmfy0wIR
The host http://t.co will reroute requests to address XQfmfy0wIR to the rbloggers post. This then leaves space for user comment.
Of the 19,000 URLs:
 8,000 are unique short URLs (i.e. manual URL entry)
 The remaining 11,000 are retweeted URLs
 One short URL (http://projectsuperior.com) was tweeted over 200 times
 Five short URLs were tweeted more than 50 times
 The mean URL is shared 2.4 times
 The median URL is shared once
To discover where these short URLs are pointing to, we need to decode the URL to resolve the target destination. There are various tools in R for decoding short URLs. I used the decode_short_url function in the twitteR package. This function requests the URL from a web service and returns the long URL as a string. This can be slow for some sites and took about a second for each URL on average. In addition, not all short URLs were resolved, and took several requests to resolve. About 80 short URLs could not be resolved, perhaps due to the target site moving or being deleted.
I used the data.table package with the cSplit function from the splitstackshape package to reshape the dataset by URL rather than by tweet. I then merged the table of decoded unique short URLs with the reshaped dataset by recoding the short codes with the factor function.
Of the 8,000 unique short URLs:
 6,400 are unique long URLs
 The remaining 1,200 are URLs that were tweeted on more than one unique occasions
 One URL (PayPal’s software blog) was tweeted on 65 unique occasions
 Six URLs were tweeted on more than 15 unique occasions
 The mean URL is tweeted on 1.2 occasions
 The median URL is tweeted once
This originality analysis identified different types of popular URL communication on Twitter.
Some URLs were very retweetable, but not discoverable.
 The http://projectsuperior.com URL was tweeted 328 times, using five unique short URLs.
 A ThoughtWorks blog post URL was tweeted 129 times, using two unique short URLs.
Some URLs were very discoverable.
 The PayPal blog post was discovered 65 times and tweeted 173 times. A third of these discoveries were retweeted.
 A presentation on net by Yuichi Ito was discovered 16 times and tweeted 33 times. A quarter of these discoveries were retweeted.
 A thread in the German language Entwicklerforum was linked on 29 unique occasions. In this case, the thread was being promoted by a single user, Entwickler himself (herself?). None of these discoveries were retweeted.
Popularity by volume definitely does not tell the whole story about how users are interacting with a URL. And discoverability on its own is insufficient to demonstrate interest, as Entwickler would perhaps admit.
The high level view can take a little while to consume from tables of links. I used a combination of tweet volume (word size) and URL discoverability (opacity) to display truly interesting websites using the wordcloud package.
This approach could be useful for prioritizing your lunchtime reading, and separate the genuinely interesting from the spambot!
To read the first blog post in this series click here
To read the second blog post in this series click here
To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Illustrated Guide to ROC and AUC
(This article was first published on joy of data » R, and kindly contributed to Rbloggers)
(In a past job interview I failed at explaining how to calculate and interprete ROC curves – so here goes my attempt to fill this knowledge gap.) Think of a regression model mapping a number of features onto a real number (potentially a probability). The resulting real number can then be mapped on one of two classes, depending on whether this predicted number is greater or lower than some choosable threshold. Let’s take for example a logistic regression and data on the survivorship of the Titanic accident to introduce the relevant concepts which will lead naturally to the ROC (Receiver Operating Characteristic) and its AUC or AUROC (Area Under ROC Curve).
Titanic Data Set and the Logistic Regression ModelEvery record in the data set represents a passenger – providing information on her/his age, gender, class, number of siblings/spouses aboard (sibsp), number of parents/children aboard (parch) and, of course, whether s/he survived the accident.
# https://github.com/joyofdata/joyofdataarticles/blob/master/rocauc/read_and_prepare_titanic_dataset.R > df < read_and_prepare_titanic_dataset("~/Downloads/titanic3.csv") > str(df) 'data.frame': 1046 obs. of 6 variables: $ survived: Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 2 1 ... $ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ... $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ... $ age : num 29 0.92 2 30 25 48 63 39 53 71 ... $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ... $ parch : int 0 2 2 2 2 0 0 0 0 0 ...The logistic regression model is tested on batches of 10 cases with a model trained on the remaining N10 cases – the test batches form a partition of the data. In short, Leave10out CV has been applied to arrive at more accurate estimation of the outofsample error rates.
# https://github.com/joyofdata/joyofdataarticles/blob/master/rocauc/log_reg.R > predictions < log_reg(df, size=10) > str(predictions) 'data.frame': 1046 obs. of 2 variables: $ survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 1 2 ... $ pred : num 0.114 0.854 0.176 0.117 0.524 ... Distribution of the PredictionsNow let’s first have a look at the distribution of survival and death cases on the predicted survival probabilities.
# https://github.com/joyofdata/joyofdataarticles/blob/master/rocauc/plot_pred_type_distribution.R > plot_pred_type_distribution(predictions, 0.7)If we consider survival as a positive (1) and death due to the accident as a negative (0) result, then the above plot illustrates the tradeoff we face upon choosing a reasonable threshold. If we increase the threshold the number of false positive (FP) results is lowered, while the number of false negative (FN) results increases.
Receiver Operating CharacteristicThis question of how to balance false positives and false negatives (depending on the cost/consequences of either mistake) arose on a major scale during World War II in context of interpretation of radar signals for identification of enemy air planes. For the purpose of visualizing and quantifying the impact of a threshold on the FP/FNtradeoff the ROC curve was introduced. The ROC curve is the interpolated curve made of points whose coordinates are functions of the threshold:
In terms of hypothesis tests where rejecting the null hypothesis is considered a positive result the FPR (false positive rate) corresponds to the Type I error, the FNR (false negative rate) to the Type II error and (1 – FNR) to the power. So the ROC for above distribution of predictions would be:
# https://github.com/joyofdata/joyofdataarticles/blob/master/rocauc/calculate_roc.R roc < calculate_roc(predictions, 1, 2, n = 100) # https://github.com/joyofdata/joyofdataarticles/blob/master/rocauc/plot_roc.R plot_roc(roc, 0.7, 1, 2)The dashed lines indicate the location of the (FPR, TPR) corresponding to a threshold of 0.7. Note that the low corner (0,0) is associated with a threshold of 1 and the top corner (1,1) with a threshold of 0.
The cost function and the corresponding coloring of the ROC points illustrate that an optimal FPR and TPR combination is determined by the associated cost. Depending on the use case false negatives might be more costly than false positive or vice versa. Here I assumed a cost of 1 for FP cases and a cost of 2 for FN cases.
Area Under (ROC) CurveThe optimal point on the ROC curve is (FPR, TPR) = (0,1). No false positives and all true positives. So the closer we get there the better. The second essential observation is that the curve is by definition monotonically increasing.
This inequation can be easily checked by looking at the first plot by mentally pushing the threshold (red line) up and down; it implies the monotonicity. Furthermore any reasonable model’s ROC is located above the identity line as a point below it would imply a prediction performance worse than random (in that case, simply inverting the predicted classes would bring us to the sunny side of the ROC space).
All those features combined make it apparently reasonable to summarize the ROC into a single value by calculating the area of the convex shape below the ROC curve – this is the AUC. The closer the ROC gets to the optimal point of perfect prediction the closer the AUC gets to 1.
# AUC for the example > library(pROC) > auc(predictions$survived, predictions$pred) Area under the curve: 0.8421 ROC and AUC for Comparison of ClassifiersMainly two reasons are responsible for why an ROC curve is a potentially powerful metric for comparison of different classifiers. One is that the resulting ROC is invariant against class skew of the applied data set – that means a data set featuring 60% positive labels will yield the same (statistically expected) ROC as a data set featuring 45% positive labels (though this will affect the cost associated with a given point of the ROC). The other is that the ROC is invariant against the evaluated score – which means that we could compare a model giving noncalibrated scores like a regular linear regression with a logistic regression or a random forest model whose scores can be considered as class probabilities.
The AUC furthermore offers interesting interpretations:
The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. [Fawcett]
[The AUC] also has various natural intuitive interpretations, one of which is that it is the average sensitivity of a classifier under the assumption that one is equally likely to choose any value of the specificity — under the assumption of a uniform distribution over specificity. [Hand]
As the ROC itself is variable with respect to a given data set it is necessary to average multiple ROCs derived from different data sets to arrive at a good estimation of a classifier’s true ROC function.
Criticism of the AUC
It seems problematic, in the first place, to absolutely measure and compare the performance of classifiers with something as simple as a scalar between 0 and 1. The main fundamental reason of this is that problem specific cost functions hurt the assumption of points in the ROC space being homogenous in that regard and by that comparable across classifiers. This nonuniformity of the cost function causes ambiguities if ROC curves of different classifiers cross and on itself when the ROC curve is compressed into the AUC by means of integration over the false positive rate.
However, the AUC also has a much more serious deficiency, and one which appears not to have been previously recognised. This is that it is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules. It is equivalent to saying that, using one classifier, misclassifying a class 1 point is p times as serious as misclassifying a class 0 point, but, using another classifier, misclassifying a class 1 point is P times as serious, where p = P. This is nonsensical because the relative severities of different kinds of misclassifications of individual points is a property of the problem, not the classifiers which happen to have been chosen. [Hand]
David J. Hand gives a statistically profound reasoning for the dubiousness of the AUC.
Sources[Fawcett]: “An introduction to ROC analysis” by Tom Fawcett
[Hand]: “Measuring classifier performance: a coherent alternative
to the area under the ROC curve” by David J. Hand
(original article published on www.joyofdata.de)
To leave a comment for the author, please follow the link and comment on his blog: joy of data » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
arXiv frenzy
(This article was first published on Xi'an's Og » R, and kindly contributed to Rbloggers)
In the few past days, there has been so many arXiv postings of interest—presumably the NIPS submission effect!—that I cannot hope to cover them in the coming weeks! Hopefully, some will still come out on the ‘Og in a near future:
 arXiv:1506.06629: Scalable Approximations of Marginal Posteriors in Variable Selection by Willem van den Boom, Galen Reeves, David B. Dunson
 arXiv:1506.06285: The MCMC split sampler: A block Gibbs sampling scheme for latent Gaussian models by Óli Páll Geirsson, Birgir Hrafnkelsson, Daniel Simpson, Helgi Sigurðarson
 arXiv:1506.06268: Bayesian Nonparametric Modeling of Higher Order Markov Chains by Abhra Sarkar, David B. Dunson
 arXiv:1506.06117: Convergence of Sequential QuasiMonte Carlo Smoothing Algorithms by Mathieu Gerber, Nicolas Chopin
 arXiv:1506.06101: Robust Bayesian inference via coarsening by Jeffrey W. Miller, David B. Dunson
 arXiv:1506.05934: Expectation Particle Belief Propagation by Thibaut Lienart, Yee Whye Teh, Arnaud Doucet
 arXiv:1506.05860: Variational Gaussian Copula Inference by Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin
 arXiv:1506.05855: The Frequentist Information Criterion (FIC): The unification of informationbased and frequentist inference by Colin H. LaMont, Paul A. Wiggins
 arXiv:1506.05757: Bayesian Inference for the Multivariate ExtendedSkew Normal Distribution by Mathieu Gerber, Florian Pelgrin
 arXiv:1506.05741: Accelerated dimensionindependent adaptive Metropolis by Yuxin Chen, David Keyes, Kody J.H. Law, Hatem Ltaief
 arXiv:1506.05269: Bayesian Survival Model based on Moment Characterization by Julyan Arbel, Antonio Lijoi, Bernardo Nipoti
 arXiv:1506.04778: Fast sampling with Gaussian scalemixture priors in highdimensional regression by Anirban Bhattacharya, Antik Chakraborty, Bani K. Mallick
 arXiv:1506.04416: Bayesian Dark Knowledge by Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling [a special mention for this title!]
 arXiv:1506.03693: Optimization Monte Carlo: Efficient and Embarrassingly Parallel LikelihoodFree Inference by Edward Meeds, Max Welling
 arXiv:1506.03074: Variational consensus Monte Carlo by Maxim Rabinovich, Elaine Angelino, Michael I. Jordan
 arXiv:1506.02564: Gradientfree Hamiltonian Monte Carlo with Efficient Kernel Exponential Families by Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton [comments coming soon!]
Filed under: R, Statistics, University life Tagged: arXiv, Bayesian statistics, MCMC, Monte Carlo Statistical Methods, Montréal, NIPS 2015, particle filter
To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The 2DHarmonograph In Shiny
(This article was first published on Ripples, and kindly contributed to Rbloggers)
If you wish to make an apple pie from scratch, you must first invent the universe (Carl Sagan)
I like Shiny and I can’t stop converting into apps some of my previous experiments: today is the turn of the harmonograph. This is a simple application since you only can press a button to generate a random harmonographsimulated curve. I love how easy is to create a nice interactive app to play with from an existing code. The only trick in this case in to add a rerun button in the UI side and transfer the interaction to the server side using a simple if. Very easy. This is a screenshot of the application:
Press the button and you will get a new drawing. Most of them are nice scrawls and from time to time you will obtain beautiful shapely curves.
And no more candy words: It is time to complain. I say to RStudio with all due respect, you are very cruel. You let me to deploy my previous app to your server but you suspended it almost immediately for fifteen days due to “exceeded usage hours”. My only option is paying at least $440 per year to upgrade my current plan. I tried the ambrosia for an extremely short time. RStudio: Why don’t you launch a cheaper account? Why don’t you launch a free account with just one perpetual alive app at a time? Why don’t you increase the usage hours threshold? I can help you to calculate the return on investment of these scenarios.
Or, Why don’t you make me a gift for my next birthday? I promise to upload a new app per month to promote your stunning tool. Think about it and please let me know your conclusions.
Meanwhile I will run the app privately. This is the code to do it:
UI.R
# This is the userinterface definition of a Shiny web application. # You can find out more about building applications with Shiny here: # # http://www.rstudio.com/shiny/ library(shiny) shinyUI(fluidPage( titlePanel("Mathematical Beauties: The Harmonograph"), sidebarLayout( sidebarPanel( #helpText(), # adding the new div tag to the sidebar tags$div(class="header", checked=NA, tags$p("A harmonograph is a mechanical apparatus that employs pendulums to create a geometric image. The drawings created typically are Lissajous curves, or related drawings of greater complexity. The devices, which began to appear in the mid19th century and peaked in popularity in the 1890s, cannot be conclusively attributed to a single person, although Hugh Blackburn, a professor of mathematics at the University of Glasgow, is commonly believed to be the official inventor. A simple, socalled "lateral" harmonograph uses two pendulums to control the movement of a pen relative to a drawing surface. One pendulum moves the pen back and forth along one axis and the other pendulum moves the drawing surface back and forth along a perpendicular axis. By varying the frequency and phase of the pendulums relative to one another, different patterns are created. Even a simple harmonograph as described can create ellipses, spirals, figure eights and other Lissajous figures (Source: Wikipedia)")), tags$div(class="header", checked=NA, HTML("<p>Click <a href="http://paulbourke.net/geometry/harmonograph/harmonograph3.html">here</a> to see an image of a real harmonograph</p>") ), actionButton('rerun','Launch the harmonograph!') ), mainPanel( plotOutput("HarmPlot") ) ) ))server.R
# This is the server logic for a Shiny web application. # You can find out more about building applications with Shiny here: # # http://www.rstudio.com/shiny/ # library(shiny) CreateDS = function () { f=jitter(sample(c(2,3),4, replace = TRUE)) d=runif(4,0,1e02) p=runif(4,0,pi) xt = function(t) exp(d[1]*t)*sin(t*f[1]+p[1])+exp(d[2]*t)*sin(t*f[2]+p[2]) yt = function(t) exp(d[3]*t)*sin(t*f[3]+p[3])+exp(d[4]*t)*sin(t*f[4]+p[4]) t=seq(1, 200, by=.0005) data.frame(t=t, x=xt(t), y=yt(t))} shinyServer(function(input, output) { dat<reactive({if (input$rerun) dat=CreateDS() else dat=CreateDS()}) output$HarmPlot<renderPlot({ plot(rnorm(1000),xlim =c(2,2), ylim =c(2,2), type="n") with(dat(), plot(x,y, type="l", xlim =c(2,2), ylim =c(2,2), xlab = "", ylab = "", xaxt='n', yaxt='n', col="gray10", bty="n")) }, height = 650, width = 650) })To leave a comment for the author, please follow the link and comment on his blog: Ripples. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Next Kölner R User Meeting: Friday, 26 June 2015
(This article was first published on mages' blog, and kindly contributed to Rbloggers)
The next Cologne R user group meeting is scheduled for this Friday, 6 June 2015 and we have an exciting agenda with two talks followed by networking drinks.
 Data Science at the Commandline (Kirill Pomogajko)
 An Introduction to RStan and the Stan Modelling Language (Paul Viefers)
Please note: Our venue changed! We have outgrown the seminar room at the Institute of Sociology and move to Startplatz, a startup incubator venue: Im Mediapark, 550670 Köln
Drinks and Networking
The event will be followed by drinks (Kölsch!) and networking opportunities.
For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.
The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.
This post was originally published on mages’ blog.
To leave a comment for the author, please follow the link and comment on his blog: mages' blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Randomized Hobbit
(This article was first published on The stupidest thing... » R, and kindly contributed to Rbloggers)
@wrathematics pointed me to his ngram R package for constructing and simulating from ngrams from text.
I’d recently grabbed the text of the hobbit, and so I applied it to that text, with amusing results.
Here’s the code I used to grab the text.
library(XML) stem < "http://www.5novels.com/classics/u5688" hobbit < NULL for(i in 1:74) { cat(i,"n") if(i==1) { url < paste0(stem, ".html") } else { url < paste0(stem, "_", i, ".html") } x < htmlTreeParse(url, useInternalNodes=TRUE) xx < xpathApply(x, "//p", xmlValue) hobbit < c(hobbit, gsub("r", "", xx[length(xx)])) Sys.sleep(0.5) }Then calculate the ngrams with n=2.
library(ngram) ng2 < ngram(hobbit, n=2)Simulate some number of words with babble(). If you use the seed argument, you can get the result reproducibly.
babble(ng2, 48, seed=53482175)into trees, and then bore to the Mountain to go through?” groaned the hobbit. “Well, are you doing, And where are you doing, And where are you?” it squeaked, as it was no answer. They were surly and angry and puzzled at finding them here in their holes
Update: @wrathematics suggested that I mix two texts, so here’s a bit from the Hobbit in the Hat (The Hobbit with 59× Cat in the Hat — upsampled to match lengths.) But there’s maybe not enough overlap between the two texts to get much of a mixture.
“I am Gandalf,” said the fish. This is no way at all!
already off his horse and among the goblin and the dragon, who had remained behind to guard the door. “Something is outside!” Bilbo’s heart jumped into his boat on to sharp rocks below; but there was a good game, Said our fish No! No! Those Things should not fly.
To leave a comment for the author, please follow the link and comment on his blog: The stupidest thing... » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Looking for Preference in All the Wrong Places: Neuroscience Suggests Choice Model Misspecification
(This article was first published on Engaging Market Research, and kindly contributed to Rbloggers)
At its core, choice modeling is a utility estimating machine. Everything has a value reflected in the price that we are willing to pay in order to obtain it. Here are a collection of Smart Watches from a search of Google Shopping. You are free to click on any one, look for more, or opt out altogether and buy nothing.
Where is the utility? It is in the brand name, the price, the user ratings and any other feature that gets noticed. If you pick only the Apple Smartwatch at its lowest price, I conclude that brand and price have high utility. It is a somewhat circular definition: I know that you value the Apple brand because you choose it, and you pick the Apple watch because the brand has value. We seem to be willing to live with such circularity as long as utility measured in one setting can be generalized over occasions and conditions. However, context matters when modeling human judgment and choice, making generalization a difficult endeavor. Utility theory is upended when higher prices alter perceptions so that the same food tastes better when it costs more.
What does any of this have to do with neuroscience? Utility theory was never about brain functioning. Glimcher and Fehr make this point in their brief history of neuroeconomics. Choice modeling is an “as if” theory claiming only that decision makers behave as if they assigned values to features and selected the option with the optimal feature mix.
When the choice task has been reduced to a set of feature comparisons as is the common practice in most choice modeling, the process seems to work at least in the laboratory (i.e., care must be taken to mimic the purchase process and adjustment may be needed when making predictions about realworld market shares). Yet, does this describe what one does when looking at the above product display from Google Shopping? I might try to compare the ingredients listed on the back of two packages while shopping in the supermarket. However, most of us find this task quickly becomes too difficult as the number of features exceeds our shortterm memory limits (paying attention is costly).
Incorporating Preference Construction
Neuroeconomics suggests how value is constructed onthefly in real world choice tasks. Specifically, reinforcement learning is supported by multiple systems within the brain: “dual controllers” for both the association of features and rewards (modelfree utilities) and the active evaluation of possible futures (modelbased search). Daw, Niv and Dayan identified the two corresponding regions of the brain and summarized the supporting evidence back in 2005.
Features can become directly tied to value so that the reward is inferred immediately from the presence of the feature. Moreover, if we think of choice modeling only as the final stages when we are deciding among a small set of alternatives in a competitive consideration set, we might mistakenly conclude that utility maximization describes decision making. As in the movies, we may wish to “flashback” to the beginning of the purchase process to discover the path that ended at the choice point where features seem to dominate the representation.
Perception, action and utility are all entangled in the wild, as shown by the work of Gershman and Daw. Attention focuses on the most or least desirable features in the context of the goals we wish to achieve. We simulate the essentials of the consumption experience and ignore the rest. Retrospection is remembering the past, and prospection is experiencing the future. The steak garners greater utility sizzling than raw because it is easier to imagine the joy of eating it.
While the cognitive scientist wants to model the details of this process, the marketer will be satisfied learning enough to make the sale and keep the customer happy. In particular, marketing tries to learn what attracts attention, engages interest and consideration, generates desire and perceived need, and drives purchase while retaining customers (i.e., the AIDA model). These are the building blocks from which value is constructed.
Choice modeling, unfortunately, can identify the impact of features only within the confines of a single study, but it encounters difficulties attempting to generalize any effects beyond the data collected. Many of us are troubled that even relatively minor changes can alter the framing of the choice task or direct attention toward a previously unnoticed aspect (Attention and Reference Dependence).
The issue is not one of data collection or statistical analysis. The R package support.BWS will assist with the experimental design, and other R packages such as bayesm, RSGHB and ChoiceModelR will estimate the parameters of a hierarchical Bayes model. No, the difficulty stems from needing to present each respondent with multiple choice scenarios. Even if we limit the number of choice sets that any one individual will evaluate, we are still forced to simplify the task in order to show all the features for all the alternatives in the same choice set. In addition, multiple choice sets impose some demands for consistency so that a choice strategy that can be used over and over again without a lot of effort is preferred by respondents just to get through the questionnaire. On the other hand, costly information search is eliminated, and there is no anticipatory regret or rethinking one’s purchase since there is no actual transfer of money. In the end, our choice model is misspecified for two reasons: it does not include the variables that drive purchase in real markets and the reactive effects of the experimental arrangements create confounding effects that do not occur outside of the study.
Measuring the Forces Shaping Preference
Consumption is not random but structured by situational need and usage occasion. “How do you intend to use your Smartwatch?” is a good question to ask when you begin your shopping, although we will need to be specific because small differences in usage can make a large difference in what is purchased. To be clear, we are not looking for wellformed preferences, for instance, feature importance or contribution to purchase. Instead, we focus on attention, awareness and familiarity that might be precursors or early phases of preference formation. If you own an iPhone, you might never learn about Android Wear. What, if anything, can we learn from the apps on your Smartphone?
I have shown how the R package NMF for nonnegative matrix factorization can uncover these building blocks. We might wish to think of NMF as a form of collaborative filtering, not unlike a recommender system that partitions users into cliques or communities and products into genres or types (e.g., scifi enthusiasts and the fantasy/thrillers they watch). An individual pattern of awareness and familiarity is not very helpful unless it is shared by a larger community with similar needs arising from common situations. Product markets evolve over time by appealing to segments willing to pay more for differentiated offerings. In turn, the new offerings solidify customer differences. This separation of products and users into segregated communities forms the scaffolding used to construct preferences, and this is where we should begin our research and statistical analysis.
To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
DeployR Data I/O
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by Sean Wells, Senior Software Engineer, Microsoft and David Russell
DeployR exists to solve a number of fundamental R analytics integration problems faced by application developers. For example, have you ever wondered how you might execute an R script from within a Webbased dashboard, an enterprise middleware solution, or a mobile application? DeployR makes it very simple. In fact, DeployR makes it very simple for any application developed in any language to:
 Provision one or more dedicated R sessions on demand
 Execute R scripts on those R sessions
 Pass inputs (data, files, etc.) when executing those R scripts
 Retrieve outputs (data, files, plots, etc.) following the execution of those R scripts
DeployR offers these features and many more through a set of Analytics Web services. These Web service interfaces isolate the application developer and the application itself from R code, from the underlying R sessions, in fact from all the complexities typically associated with R integration. DeployR client libraries currently available in Java, JavaScript and .NET make integration for application developers a snap. As application developers, we can leave R coding and model building to the data scientists. Now we can also leave all aspects of R session management to the DeployR server. This frees us up to focus on simple integrations with DeployR services that deliver the phenomenal power of R directly within our applications.
But what we want to focus on in this post is the rich set of data inputs and data outputs than can be passed and retrieved when executing R scripts on DeployR. And rather than tell you how you can work with inputs and outputs we think it's better to show you. Which is why we have prepared extensive tutorials on DeployR Data I/O which includes full source code and run instructions available on github.
Taking just one example application from these tutorials, RepoFileInEncodedDataOut, let's briefly explore what's going on. The application generates the following console output:
Code Console Output CONFIGURATION Using endpoint http://localhost:7400/deployr CONFIGURATION Using broker config [ PooledBrokerConfig ] CONNECTION Established authenticated pooled broker [ RBroker ] TASK INPUT Repository binary file input set on task, [ PooledTaskOptions.preloadWorkspace ] TASK OPTION DeployRencoded R object request set on task [ PooledTaskOptions.routputs ] EXECUTION Pooled task submitted to broker [ RTask ] TASK RESULT Pooled task completed in 1833ms [ RTaskResult ] TASK OUTPUT Retrieved DeployRencoded R object output hip [ RDataFrame ] TASK OUTPUT Retrieved DeployRencoded R object output hipDim [ RNumericVector ] TASK OUTPUT Retrieved DeployRencoded R object hipDim value=[2719.0, 9.0] TASK OUTPUT Retrieved DeployRencoded R object output hipNames [ RStringVector ] TASK OUTPUT Retrieved DeployRencoded R object hipNames value=[HIP, Vmag, RA, DE, Plx, pmRA, pmDE, e_Plx, B.V]This console output can be interpreted as follows:
 CONFIGURATION & CONNECTION – the application connects and provisions R sessions on DeployR
 TASK INPUT – the application identifies a repositorymanaged file for autoloading prior to task execution
 TASK OPTION – the application identifies the set of R workspace objects to be returned following task execution
 EXECUTION – the application executes the task
 TASK OUTPUT – the application retrieves the set of R workspace objects returned following task execution
Take a look at the Java source code for this particular example application here. As with all examples, the source code is heavily documented to help you understand the implementation stepbystep.
These tutorials are currently written in Java but the capabilities and mechanisms demonstrated apply equally across all languages and platforms. Check out the tutorials and let us know what you think.
To leave a comment for the author, please follow the link and comment on his blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Editing Dockerfiles with Architect
(This article was first published on Open Analytics  Blog, and kindly contributed to Rbloggers)
Monday 22 June 2015 – 14:36In the last two years, software development had taken a step forward with the advent of the Docker. For the uninitiated, a Docker is a tool that automates the deployment of applications by packaging them with their dependencies in a virtual container, eliminating the need for virtual machines.
Docker has streamlined the process of application development on Linux servers, and Open Analytics has streamlined the creation of Docker files with Architect’s new Dockerfile Editor.
Installation
To install the Dockerfile Editor, first click Help > Install New Software
Next, add the URL http://dockereditor.openanalytics.eu/update as a new repository.
Finally, select the “Dockerfile Editor” and click next. NOTE: You must first deselect the “Group items by category” option.
Using the Dockerfile Editor
For basic instructions on using the Docker Editor, please visit out guide, Getting Started with Dockerfile Editor.
Additional Resources:
 Dockerfile reference
 Rocker, R Configurations for Docker
 Docker Tooling for Eclipse
 Docker Tooling Wiki
To leave a comment for the author, please follow the link and comment on his blog: Open Analytics  Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New Shiny cheat sheet and video tutorial
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
We’ve added two new tools that make it even easier to learn Shiny.
Video tutorialThe How to Start with Shiny training video provides a new way to teach yourself Shiny. The video covers everything you need to know to build your own Shiny apps. You’ll learn:
 The architecture of a Shiny app
 A template for making apps quickly
 The basics of building Shiny apps
 How to add sliders, drop down menus, buttons, and more to your apps
 How to share Shiny apps
 How to control reactions in your apps to
 update displays
 trigger code
 reduce computation
 delay reactions
 How to add design elements to your apps
 How to customize the layout of an app
 How to style your apps with CSS
Altogether, the video contains two hours and 25 minutes of material organized around a navigable table of contents.
Best of all, the video tutorial is completely free. The video is the result of our recent How to Start Shiny webinar series. Thank you to everyone who attended and made the series a success!
Watch the new video tutorial here.
New cheat sheetThe new Shiny cheat sheet provides an uptodate reference to the most important Shiny functions.
The cheat sheet replaces the previous cheat sheet, adding new sections on singlefile apps, reactivity, CSS and more. The new sheet also gave us a chance to apply some of the things we’ve learned about making cheat sheets since the original Shiny cheat sheet came out.
Get the new Shiny cheat sheet here.
To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Stop and Frisk: Blacks stopped 36 times more than Whites over 10 years
(This article was first published on Stable Markets » R, and kindly contributed to Rbloggers)
The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.
I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository.
Here are some preliminary descriptive statistics:
The data shows some interesting trends:
 Stops had been increasing steadily from 2003 to 2012, but falling since 2012.
 The percentage of stopped persons who were black was consistently 3.56.5 times higher than the percentage of stopped persons who were white.
 The data indicates whether or not officers explained the reason for stop to the stopped person. The data shows that police gave an explanation about 9899% of the time. Of course, this involves a certain level of trust since the data itself is recorded by police. There is no difference in this statistic across race and sex.
 The median age of stopped persons was 24. The distribution was roughly the same across race and sex.
A few notes on the data:
 The raw data is saved as CSV files, one file for each year. However, the same variables are not tracked in each year. The .RData file on Github only contains select variables.
 The importing and cleaning codes can take about 15 minutes to run.
 All stops in all years have coordinates marking the location of the stop, however I’m still unable to make sense of them. I plan to publish another post with some spatial analyses.
The coding for this was particularly interesting because I had never used R to download ZIP files from the web. I reproduced this portion of the code below. It produces one dataset for each year from 2013 to 2014.
for(i in 2013:2014){ temp < tempfile() url<paste("http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/",i,"_sqf_csv.zip",sep='') download.file(url,temp) assign(paste("d",i,sep=''),read.csv(unz(temp, paste(i,".csv",sep='')))) } unlink(temp)To leave a comment for the author, please follow the link and comment on his blog: Stable Markets » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
12 new R jobs (20150622) – from @California to @Nottingham
This is the bimonthly post (for 20150622) for new R Jobs from Rusers.com.
Employers: visit this link to post a new R job to the R community (it’s free and quick).
Job seekers: please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).

 FullTime
 Data Scientist (@Nottingham)
Capital One – Posted by S.Kugadasan  Nottingham
England, United Kingdom  19 Jun2015

 FullTime
 Lead Data Scientist (@Nottingham)
Capital One – Posted by S.Kugadasan  Nottingham
England, United Kingdom  19 Jun2015

 FullTime
 Instructional Technologist (Quantitative Applications) @ReedCollege @Portland
Reed College – Posted by KBott  Portland
Oregon, United States  19 Jun2015

 PartTime
 Data Scientist Intern at eBay (@Netanya)
ebay Marketplace Israel – Posted by ebay  Netanya
Center District, Israel  18 Jun2015

 FullTime
 Statistician Intermediate at Rice University (@Houston)
Children’s Environmental Health Initiative – Posted bykafi@umich.edu  Houston
Texas, United States  18 Jun2015

 FullTime
 Data Scientist (@TelAviv)
ubimo – Posted by Tal  Tel AvivYafo
Tel Aviv District, Israel  18 Jun2015

 FullTime
 Data Scientist/Statistician (@KfarSaba)
Haimke  Kefar Sava
Center District, Israel  17 Jun2015

 FullTime
 R Programmer / Data Analyst (@SanDiego)
University of California San Diego Dept. of Radiation Medicine – Posted by lmell  San Diego
California, United States  16 Jun2015

 FullTime
 Data Scientist (@Prague)
CGI – Posted by CGI  Prague
Hlavní město Praha, Czech Republic  16 Jun2015

 FullTime
 R Programmer (@Connecticut)
Paul Dai – Real Staffing – Posted by Paul Dai  Ridgefield
Connecticut, United States  11 Jun2015

 FullTime
 Data Scientist (@Holon)
Yoni Schamroth / Perion Networks – Posted by Yoni Schamroth  Holon
Center District, Israel  11 Jun2015

 FullTime
 Quantitative Research Assistant for predicting the effects of public policy
American Enterprise Institute – Posted by clairegaut  Anywhere
 10 Jun2015
Trading Moving Averages with Less Whipsaws
(This article was first published on Quintuitive » R, and kindly contributed to Rbloggers)
Using a simple moving average to time markets has been a successful strategy over a very long period of time. Nothing to brag home about, but it cuts the drawdown of a buy and hold by about a half, sacrificing less than 1% of the CAGR in the process. In two words, simple yet effective. Here are the important numbers (using the S&P 500 index from 1994 to 2013, inclusive):
Statitistic SMA 200 Buy and Hold CAGR 6.51% 7.13% Sharpe Ratio 0.56 0.37 Sortino Ratio 0.05 0.04 Pct in Market 70.14 99.94 Winning Pct 54.61% 53.84% Avg Drawdown 2.13% 2.48% Avg Annual Drawdown 9.12% 15.83% Max Drawdown 28.28% 56.78% Gain to Pain 0.71 0.45The moving average system is better by far. First, we can leverage it 2:1 (thus, bringing CAGR up to approximately 13%) and still, our maximum drawdown is going to be comparable to the buy and hold. Furthermore, it’s in the market only 70% – less risk and probably the returns can be further boosted by putting the money to work in alternative assets.
One problem with these type of systems are whipsaws. The system works well when the price stays away from the moving average. However, in close proximity, one may have to enter/exit in succession losing money on the way.
One way to address this issue is to use an alternative “line” to trigger the exits (or the entries, for that matter). It could be a percentage band, but that’s hardly universal. Better to bring volatility into picture.
Let’s do something more robust as an illustration.
require(quantmod) require(btutils) # For construct.indicator # Download data gspc = getSymbols("^GSPC",from="19000101",auto.assign=F) # Compute the returns rets = ROC(Cl(gspc),type="discrete") # Compute the adjusted returns adj.rets = sqrt(runSum(rets*rets,10)/9) # The moving average sma = runMean(adj.rets,n=200) # The standard deviation stddev = sqrt(runSum(adj.rets*adj.rets,200)/199) # Long trades are entered when the average turns positive upper.band = 0.0 # Long trades are closed when the average return goes below 0.5 standard deviations lower.band = 0.05*stddev # For the "uncushioned" version use # lower.band = 0 uu = ifelse(sma>upper.band,1,0) dd = ifelse(sma<lower.band,1,0) long.entries = (uu == 1 & lag(uu) == 0) long.exits = (dd == 1 & lag(dd) == 0) short.entries = long.entries short.exits = long.entries short.entries[] = FALSE short.exits[] = FALSE # Given entries and exits (both for long and short positions), this function builds # the corresponding indicator. It's pure C++, so it's fast too. ind = btutils::construct.indicator(long.entries, long.exits, short.entries, short.exits)First we apply the moving average not to the prices, but to the returns (an interpretation of David Varadi’s Error Adjusted Momentum). The “cushioned” system goes long when the average becomes positive, but to exit, it leaves some cushion below the moving average. What do we gain?
Statitistic Cushioned EA EA SMA 200 Buy and Hold CAGR 10.93% 9.11% 6.51% 7.13% Sharpe Ratio 0.80 0.69 0.56 0.37 Sortino Ratio 0.07 0.06 0.05 0.04 Pct in Market 80.28% 76.29% 70.14% 99.94% Winning Pct 55.21% 55.00% 54.61% 53.84% Avg Drawdown 1.88% 2.05% 2.13% 2.48% Avg Annual Drawdown 8.90% 9.26% 9.12% 15.83% Max Drawdown 19.39% 19.57% 28.28% 56.78% Gain to Pain 1.23 0.98 0.71 0.45To compare apples to apples, we added two systems. The first one (dubbed EA) uses error adjusted returns to compute the SMA, but enters and exits when the SMA crosses the 0 line. The “cushioned” version is the system implemented by the above code.
Even after taking into account that the new strategies stays longer in the market, there seems to be slight improvement. But that’s not the point. The “cushioned” approach did 4 trades in total – it exited for the two bear markets. That’s about 4 trades. The noncushioned approach had 80 trades. Mission accomplished in other words.
The post Trading Moving Averages with Less Whipsaws appeared first on Quintuitive.
To leave a comment for the author, please follow the link and comment on his blog: Quintuitive » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Organize a walk around London with R
(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to Rbloggers)
The subtitle of this post can be “How to plot multiple elements on interactive web maps in R“.
In this experiment I will show how to include multiple elements in interactive maps created using both plotGoogleMaps and leafletR. To complete the work presented here you would need the following packages: sp, raster, plotGoogleMaps and leafletR.
I am going to use data from the OpenStreet maps, which can be downloaded for free from this website: weogeo.com
In particular I downloaded the shapefile with the stores, the one with the tourist attractions and the polyline shapefile with all the roads in London. I will assume that you want to spend a day or two walking around London, and for this you would need the location of some hotels and the locations of all the Greggs in the area, for lunch. You need to create a web map that you can take with you when you walk around the city with all these customized elements, that’s how you create it.
Once you have downloaded the shapefile from weogeo.com you can open them and assign the correct projection with the following code:
stores < shapefile("weogeo_j117529/data/shop_point.shp")projection(stores)=CRS("+init=epsg:3857")
roads < shapefile("weogeo_j117529/data/route_line.shp")
projection(roads)=CRS("+init=epsg:3857")
tourism < shapefile("weogeo_j117529/data/tourism_point.shp")
projection(tourism)=CRS("+init=epsg:3857")
To extract only the data we would need to the map we can use these lines:
Greggs < stores[stores$NAME %in% c("Gregg's","greggs","Greggs"),]Hotel < tourism[tourism$TOURISM=="hotel",]
Hotel < Hotel[sample(1:nrow(Hotel),10),]
Footpaths < roads[roads$ROUTE=="foot",]
plotGoogleMaps
I created three objects, two are points (Greggs and Hotel) and the last is of class SpatialLinesDataFrame. We already saw how to plot Spatial objects with plotGoogleMaps, here the only difference is that we need to create several maps and then link them together.
Let’s take a look at the following code:
Hotel.google < plotGoogleMaps(Hotel,iconMarker=rep("http://www.linguistics.ucsb.edu/projects/weal/images/hotel.png",nrow(Hotel)),mapTypeId="ROADMAP",add=T,flat=T,legend=F,layerName="Hotels",previousMap=Greggs.google)
plotGoogleMaps(Footpaths,col="dark green",mapTypeId="ROADMAP",filename="Multiple_Objects_GoogleMaps.html",legend=F,previousMap=Hotel.google,layerName="Footpaths",strokeWeight=2)
As you can see I first create two objects using the same function and then I call again the same function to draw and save the map. I can link the three maps together using the option add=T and previousMap.
We need to be careful here though, because the use of the option add is different from the standard plot function. In plot I can call the first and then if I want to add a second I call again the function with the option add=T. Here this option needs to go in the first and second calls, not in the last. Basically in this case we are saying to R not to close the plot because later on we are going to add elements to it. In the last line we do not put add=T, thus saying to R to go ahead and close the plot.
Another important option is previousMap, which is used starting from the second plot to link the various elements. This option is used referencing the previous object, meaning that I reference the map in Hotel.google to the map map to Greggs.google, while in the last call I reference it to the previous Hotel.google, not the very first.
The zoom level, if you want to set it, goes only in the first plot.
Another thing I changed compared to the last example is the addition of custom icons to the plot, using the option iconMarker. This takes a vector of icons, not just one, with the same length of the SpatialObject to be plotted. That is why I use the function rep, to create a vector with the same URL repeated for a number of times equal to the length of the object.
The icon can be whatever image you like. You can find a collection of free icons from this website: http://kml4earth.appspot.com/icons.html
The result is the map below, available here: Multiple_Objects_GoogleMaps.html
leafletR
We can do the same thing using leafletR. We first need to create GeoJSON files for each element of the map using the following lines:
Hotel.geojson < toGeoJSON(Hotel)
Footpaths.geojson < toGeoJSON(Footpaths)
Now we need to set the style for each element. For this task we are going to use the function styleSingle, which basically defines a single style for all the elements of the GeoJSON. This differ from the map in a previous post in which we used the function styleGrad to create graduated colors depending of certain features of the dataset.
We can change the icons of the elements in leafletR using the following code:
Hotel.style < styleSingle(marker=c("lodging", "blue", "s"))
Footpaths.style < styleSingle(col="darkred",lwd=4)
As you can see we have the option marker that takes a vector with the name of the icon, its color and its size (between “s” for small, “m” for medium and “l” for large). The names of the icons can be found here: https://www.mapbox.com/maki/, where you have a series of icons and if you hover the mouse over them you would see some info, among which there is the name to use here, as the very last name. The style of the lines is set using the two options col and lwd, for line width.
Then we can simply use the function leaflet to set the various elements and styles of the map:
leaflet(c(Greggs.geojson,Hotel.geojson,Footpaths.geojson),style=list(Greggs.style,Hotel.style,Footpaths.style),popup=list(c("NAME"),c("NAME"),c("OPERATOR")),base.map="osm")The result is the image below and the map available here: http://www.fabioveronesi.net/Blog/map.html
R code snippets created by Pretty R at insideR.org
To leave a comment for the author, please follow the link and comment on his blog: R tutorial for Spatial Statistics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
SAS PROC MCMC example 12 in R: Change point model
(This article was first published on Wiekvoet, and kindly contributed to Rbloggers)
I restarted at working my way through the PROC MCMC examples. The SAS manual describes this example: Consider the data set from Bacon and Watts (1971), where is the logarithm of the height of the stagnant surface layer and the covariate is the logarithm of the flow rate of water.
It is a simple example. It provided no problems at all for STAN and Jags. For LaplacesDemon on the other hand I had some problems. It took me quite some effort to obtain samples which seemed to be behaving. I did not try to do this in MCMCpack, but noted that the function MCMCregressChange() uses a slightly different model. The section below shows first the results, at the bottom the code is given.
Previous post in the series PROC MCMC examples programmed in R were: example 61.1: sampling from a known density, example 61.2: Box Cox transformation, example 61.5: Poisson regression, example 61.6: NonLinear Poisson Regression, example 61.7: Logistic Regression RandomEffects Model, and example 61.8: Nonlinear Poisson Regression Multilevel RandomEffects Model
Data
Data are read as below.
observed <
‘1.12 1.39 1.12 1.39 0.99 1.08 1.03 1.08
0.92 0.94 0.90 0.80 0.81 0.63 0.83 0.63
0.65 0.25 0.67 0.25 0.60 0.12 0.59 0.12
0.51 0.01 0.44 0.11 0.43 0.11 0.43 0.11
0.33 0.25 0.30 0.25 0.25 0.34 0.24 0.34
0.13 0.44 0.01 0.59 0.13 0.70 0.14 0.70
0.30 0.85 0.33 0.85 0.46 0.99 0.43 0.99
0.65 1.19′
observed < scan(text=gsub(‘[[:space:]]+’,’ ‘,observed),
what=list(y=double(),x=double()),
sep=’ ‘)
stagnant < as.data.frame(observed)
LaplacesDemon
I have been playing around with LaplacesDemon. There is actually a function
LaplacesDemon.hpc which can use multiple cores. However, on my computer it seems more efficient just to use mclapply() from the parallel package and give the result class LaplacesDemon.hpc . Having said that, I had again quite some trouble to get LaplacesDemon to work well. In the end I used a combination of two calls to LaplacesDemon. The plot below shows selected samples after the first run. Not good enough, but that I do like this way of displaying the results of chains. It should be added that the labels looked correct with all parameters. However, that gave to much output for this blog. In addition, after the second call the results looked acceptable.
Call:
LaplacesDemon(Model = Model, Data = MyData, Initial.Values = apply(cc1$Posterior1,
2, median), Covar = var(cc1$Posterior1), Iterations = 1e+05,
Algorithm = “RWM”)
Acceptance Rate: 0.2408
Algorithm: RandomWalk Metropolis
Covariance Matrix: (NOT SHOWN HERE; diagonal shown instead)
alpha beta1 beta2 cp s2
4.920676e04 2.199525e04 3.753738e04 8.680339e04 6.122862e08
Covariance (Diagonal) History: (NOT SHOWN HERE)
Deviance Information Criterion (DIC):
All Stationary
Dbar 144.660 144.660
pD 7.174 7.174
DIC 137.486 137.486
Initial Values:
alpha beta1 beta2 cp s2
0.5467048515 0.4100040451 1.0194586232 0.0166405998 0.0004800931
Iterations: 4e+05
Log(Marginal Likelihood): 56.92606
Minutes of runtime: 1.32
Model: (NOT SHOWN HERE)
Monitor: (NOT SHOWN HERE)
Parameters (Number of): 5
Posterior1: (NOT SHOWN HERE)
Posterior2: (NOT SHOWN HERE)
Recommended BurnIn of Thinned Samples: 0
Recommended BurnIn of Unthinned Samples: 0
Recommended Thinning: 5
Specs: (NOT SHOWN HERE)
Status is displayed every 100 iterations
Summary1: (SHOWN BELOW)
Summary2: (SHOWN BELOW)
Thinned Samples: 40000
Thinning: 1
Summary of All Samples
Mean SD MCSE ESS LB
alpha 5.348239e01 0.0244216567 2.999100e04 11442.06 4.895229e01
beta1 4.196180e01 0.0142422533 1.661658e04 12726.60 4.469688e01
beta2 1.013882e+00 0.0164892337 1.681833e04 15191.59 1.046349e+00
cp 2.855852e02 0.0308177765 3.649332e04 11945.66 3.406306e02
s2 4.472644e04 0.0001429674 1.383748e06 16571.94 2.474185e04
Deviance 1.446602e+02 3.7879060637 4.940488e02 10134.91 1.496950e+02
LP 4.636511e+01 1.8939530321 2.470244e02 10134.91 4.164313e+01
Median UB
alpha 0.53339024 5.842152e01
beta1 0.41996859 3.903572e01
beta2 1.01387256 9.815650e01
cp 0.03110570 8.398674e02
s2 0.00042101 8.006666e04
Deviance 145.46896682 1.352162e+02
LP 46.76949458 4.888251e+01
Summary of Stationary Samples
Mean SD MCSE ESS LB
alpha 5.348239e01 0.0244216567 2.999100e04 11442.06 4.895229e01
beta1 4.196180e01 0.0142422533 1.661658e04 12726.60 4.469688e01
beta2 1.013882e+00 0.0164892337 1.681833e04 15191.59 1.046349e+00
cp 2.855852e02 0.0308177765 3.649332e04 11945.66 3.406306e02
s2 4.472644e04 0.0001429674 1.383748e06 16571.94 2.474185e04
Deviance 1.446602e+02 3.7879060637 4.940488e02 10134.91 1.496950e+02
LP 4.636511e+01 1.8939530321 2.470244e02 10134.91 4.164313e+01
Median UB
alpha 0.53339024 5.842152e01
beta1 0.41996859 3.903572e01
beta2 1.01387256 9.815650e01
cp 0.03110570 8.398674e02
s2 0.00042101 8.006666e04
Deviance 145.46896682 1.352162e+02
LP 46.76949458 4.888251e+01
STAN
Stan did not give much problems for this analysis.
Inference for Stan model: smodel.
4 chains, each with iter=2000; warmup=1000; thin=1;
postwarmup draws per chain=1000, total postwarmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
Beta[1] 0.42 0.00 0.01 0.45 0.43 0.42 0.41 0.39 1017 1.00
Beta[2] 1.01 0.00 0.02 1.05 1.02 1.01 1.00 0.98 1032 1.00
Alpha 0.54 0.00 0.03 0.49 0.52 0.53 0.55 0.59 680 1.00
s2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1361 1.00
cp 0.03 0.00 0.03 0.04 0.00 0.03 0.05 0.09 636 1.00
lp__ 90.63 0.06 1.78 86.17 89.71 91.00 91.91 93.03 935 1.01
Samples were drawn using NUTS(diag_e) at Fri Jun 19 21:17:54 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
JAGS
Again no problems for Jags.
Inference for Bugs model at “/tmp/Rtmpy4a6C5/modeld4f6e9c9055.txt”, fit using jags,
4 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
n.sims = 4000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat
alpha 0.534 0.027 0.479 0.518 0.533 0.552 0.586 1.040
beta[1] 0.420 0.015 0.450 0.429 0.420 0.410 0.389 1.013
beta[2] 1.014 0.017 1.049 1.024 1.014 1.003 0.980 1.023
cp 0.029 0.035 0.038 0.006 0.032 0.051 0.100 1.037
s2 0.000 0.000 0.000 0.000 0.000 0.001 0.001 1.004
deviance 144.501 3.986 149.634 147.378 145.432 142.584 134.378 1.021
n.eff
alpha 160
beta[1] 380
beta[2] 290
cp 160
s2 710
deviance 290
For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
DIC info (using the rule, pD = var(deviance)/2)
pD = 7.9 and DIC = 136.6
DIC is an estimate of expected predictive error (lower deviance is better).
CODE used
# http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_mcmc_examples18.htm
# Example 61.12 Change Point Models
##############
#Data
##############
observed <
‘1.12 1.39 1.12 1.39 0.99 1.08 1.03 1.08
0.92 0.94 0.90 0.80 0.81 0.63 0.83 0.63
0.65 0.25 0.67 0.25 0.60 0.12 0.59 0.12
0.51 0.01 0.44 0.11 0.43 0.11 0.43 0.11
0.33 0.25 0.30 0.25 0.25 0.34 0.24 0.34
0.13 0.44 0.01 0.59 0.13 0.70 0.14 0.70
0.30 0.85 0.33 0.85 0.46 0.99 0.43 0.99
0.65 1.19′
observed < scan(text=gsub(‘[[:space:]]+’,’ ‘,observed),
what=list(y=double(),x=double()),
sep=’ ‘)
stagnant < as.data.frame(observed)
#plot(y~x,data=stagnant)
##############
#LaplacesDemon
##############
library(‘LaplacesDemon’)
library(parallel)
mon.names < “LP”
parm.names < c(‘alpha’,paste(‘beta’,1:2,sep=”),’cp’,’s2′)
PGF < function(Data) {
x <c(rnorm(5,0,1))
x[4] < runif(1,1.3,1.1)
x[5] < runif(1,0,2)
x
}
MyData < list(mon.names=mon.names,
parm.names=parm.names,
PGF=PGF,
x=stagnant$x,
y=stagnant$y)
#N<1
Model < function(parm, Data)
{
alpha=parm[1]
beta=parm[2:3]
cp=parm[4]
s2=parm[5]
yhat < alpha+(Data$xcp)*beta[1+(Data$x>=cp)]
LL < sum(dnorm(Data$y,yhat,sd=sqrt(s2),log=TRUE))
prior < sum(dnorm(parm[1:3],0,1e3,log=TRUE))+
dunif(cp,1.3,1.1,log=TRUE)+
dunif(s2,0,5,log=TRUE)
LP=LL+prior
Modelout < list(LP=LP, Dev=2*LL, Monitor=LP,
yhat=yhat,
parm=parm)
return(Modelout)
}
Fit1 < mclapply(1:4,function(i)
LaplacesDemon(Model,
Data=MyData,
Iterations=100000,
Algorithm=’RWM’,
Covar=c(rep(.01,4),.00001),
Initial.Values = c(.5,.4,1,.05,.001)) #Initial.Values
)
class(Fit1) < ‘demonoid.hpc’
plot(Fit1,Data=MyData,Parms=c(‘alpha’))
cc1 < Combine(Fit1,MyData)
#
Fit2 < mclapply(1:4,function(i)
LaplacesDemon(Model,
Data=MyData,
Iterations=100000,
Algorithm=’RWM’,
Covar=var(cc1$Posterior1),
Initial.Values = apply(cc1$Posterior1,2,median)) #Initial.Values
)
class(Fit2) < ‘demonoid.hpc’
#plot(Fit2,Data=MyData,Parms=c(‘alpha’))
cc2 < Combine(Fit2,MyData)
cc2
##############
#STAN
##############
stanData < list(
N=nrow(stagnant),
x=stagnant$x,
y=stagnant$y)
library(rstan)
smodel < ‘
data {
int <lower=1> N;
vector[N] x;
vector[N] y;
}
parameters {
real Beta[2];
real Alpha;
real <lower=0> s2;
real <lower=1.3,upper=1.1> cp;
}
transformed parameters {
vector[N] yhat;
for (i in 1:N)
yhat[i] < Alpha+(x[i]cp)*Beta[1+(x[i]>cp)];
}
model {
y ~ normal(yhat,sqrt(s2));
s2 ~ uniform(0,1e3);
cp ~ uniform(1.3,1.1);
Alpha ~ normal(0,1000);
Beta ~ normal(0,1000);
}
‘
fstan < stan(model_code = smodel,
data = stanData,
pars=c(‘Beta’,’Alpha’,’s2′,’cp’))
fstan
##############
#Jags
##############
library(R2jags)jagsdata < list(
N=nrow(stagnant),
x=stagnant$x,
y=stagnant$y)
jagsm < function() {
for(i in 1:N) {
yhat[i] < alpha+(x[i]cp)*beta[1+(x[i]>cp)]
y[i] ~ dnorm(yhat[i],tau)
}
tau < 1/s2
s2 ~ dunif(0,1e3)
cp ~ dunif(1.3,1.1)
alpha ~ dnorm(0,0.001)
beta[1] ~ dnorm(0,0.001)
beta[2] ~ dnorm(0,0.001)
}
params < c(‘alpha’,’beta’,’s2′,’cp’)
myj <jags(model=jagsm,
data = jagsdata,
n.chains=4,
n.iter=10000,
parameters=params,
)
myj
To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
dimple charts for R
(This article was first published on FishyOperations, and kindly contributed to Rbloggers)
dimple is a simpletouse charting API powered by D3.js.
Making use of the nice htmlwidgets package it only took a minimum amount of coding to make the dimple library available from R.
You can find the dimple R package at github.com/Bart6114/dimple and some documentation and examples at: bart6114.github.io/dimple (can take a while to load). Using the package you can create static javascriptbased graphs, but you can also use dimple charts in Shiny applications.
To leave a comment for the author, please follow the link and comment on his blog: FishyOperations. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Interactive R Notebooks with Jupyter and SageMathCloud
(This article was first published on Statisfactions: The Sounds of Data and Whimsy » R, and kindly contributed to Rbloggers)
The Jupyter Notebook interface allows users to seamlessly mix code, output, and markdown commentary all in the same document. Sound a bit like R Markdown? The difference is that in a Jupyter Notebook, you execute the code right in the browser and view everything in the same visual view.
Jupyter, formerly known as the IPython Notebook, has a rich ecosystem, especially among Python users. Science News recently featured Jupyter and its benefits for reproducible research and collaboration; Github, which automagically renders Jupyter notebook files on their website, has a gallery featuring a wide variety of cool static renderings of notebooks.
The name Jupyter (JUlia, PYThon, R) shows that the project has ambitions outside of its Python homeland. But so far, I’ve seen very little discussion in the R community, besides for Tony Hirst’s recent musings on synergies between R Markdown and Jupyter.. R basically works in Jupyter, though it has nothing like the support that knitr now has for shiny and a variety of other fancier interactive R features in R Markdown and Rnw files.
It’s still pretty cool to be able to see the output & change the code together in Jupyter, though. First, I recommend checking out the demo on the Jupyter web site. You can then click on the
“Welcome to R” demo and play around with the code:
To update the output from a compute cell, you can press CTRL + ENTER or simply press the ▶ on the upper toolbar. Note that Jupyter also has nice Tabcompletion, menus, and a host of other features! You can also upload or start your own files.
But, this is just a temporary server. How can you bring Jupyter into your workflow? You can install Jupyter and the R Kernel locally on your computer, which is relatively straightforward on Linux but looks like a pain on any other OS.
Another way: there are a few commercial cloud services that host Jupyter notebooks, and the one I like right now is SageMathCloud. The Sage Math project has come a long way on its goals of developing opensource math and collaboration tools, and I see a host of fascinating features on this app, such as course management tools for classroom use, collaborative editing and live video chat alongside a project & file system for managing Jupyter notebooks and Sage Math worksheets, and the ability to open multiple notebooks in different tabs:
I have been startled both by the robustness and activity of the development of Jupyter, and the relatively little discussion from the R community. Let’s get into Jupyter and see what we can create, and how we can shape the future of this highly promising tool for collaborative, reproducible data science & statistics.
To leave a comment for the author, please follow the link and comment on his blog: Statisfactions: The Sounds of Data and Whimsy » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
DO Something Nifffty with R
(This article was first published on rud.is » R, and kindly contributed to Rbloggers)
@briandconnelly (of pushoverr fame) made a supercool post about connecting R to @IFTTT via IFTTT’s “Maker” channel. The IFTTT Maker interface to receive events is fairly straightforward and Brian’s code worked flawlessly, so it was easy to tweak a bit and wrap into a package.
To get started, you can clone an example public IFTTT recipe that posts data to a Dropbox folder. All you need to do is add the IFTTT channel to your account and add the API key to your .Renviron file, which will look something like:
IFTTT_API_KEY=uGlySTringOfCharACTersSticking it in the .Renviron file is a nice platformindependent way of getting the variable in your environment (and you can set a similar environment variable in Travis for automated testing). NOTE that you’ll need to restart your R/RStudio session for that value to become available (what, you don’t have RStudio up all the time?).
Once you have the key in place and the IFTTT recipe setup, all you have to do is:
devtools::install_github("hrbrmstr/nifffty") library(nifffty) maker("rtest", "this", "is a", "test")and you’ll have a file named something like june_19__2015_at_0927am.txt in a Dropbox folder that should have values like:
Value 1: this Value 2: is a Value 3: testThe potential IFTTT integrations are not limited to Dropbox (or iOS/Android notification as in Brian’s example). You can literally trigger anything from R in the IFTTT catalogue. Go foRth and automate!
DOing even MoReAs stated, Brian’s code was complete, so making a package to post to IFTTT’s Maker channel was really easy. But, this is R and if we can get data to it, things can be even cooler. So, I made a receiver function that will catch Maker channel web requests and do stuff with them. Since I have an Apple Watch, I decided I wanted to test the interface out with IFTTT’s DO button and have R receive my coordinates whenever I press this:
(That’s an actual screen capture from the Apple Watch).
If you have both R and an Apple Watch, you can create a DO app as such (NOTE that DO apps work fine on the iPhone without an Apple Watch):
That will POST a bit of JSON to whatever is listening for it. To get R to listen to it, you need to build a small script that I’ll call listen.R, create a function to process the data and register it with the instance of the Rook web server that it will automagically create for you. You can specify which IP address and/or port to listen on as well. NOTE that this R server must be reachable from the internet, so you may need to do port forwarding if behind a NAT firewall.
My sample listen.R looks like this:
library(nifffty) # not the best name for a function but it works do_it < function(req) { require(jsonlite) print(fromJSON(names(req$POST()))) writeLines(names(req$POST()), "/tmp/bob.txt") } # you can change the port to match what you put into IFTTT rcvr < receiver(port=10999, handler=do_it) # this is not necessary but it's a good diagnostic for a test script print(rcvr) # keeps the script alive while (TRUE) Sys.sleep(24 * 60 * 60)You can run that from a commandline (I tested it on Ubuntu) as such:
bob@server:~$ Rscript listen.R Loading required package: methods Loading required package: Rook Server started on 0.0.0.0:10999 [1] nifffty http://0.0.0.0:10999/custom/nifffty Call browse() with an index number or name to run an application.When you tap the DO button on the Apple Watch, you’ll see the following in the console:
Loading required package: jsonlite Attaching package: ‘jsonlite’ The following object is masked from ‘package:utils’: View $latitude [1] 43.25931 $longitude [1] 70.80062 $timestamp [1] "June 19, 2015 at 04:45PM"and the following in the file it was instructed to write to:
bob@server:~$ cat /tmp/bob.txt { "latitude": 43.2593566552207, "longitude": 70.8004647307757, "timestamp": "June 19, 2015 at 04:46PM" }JSON is a good choice given how easy it is to work with in R and you can have R do anything you can dream up with the data that’s sent to it. You can connect any supported IFTTT component and hook any variable from said component into the JSON that R will receive.
Coming SoonThe next step is to make a Shiny interface so it can receive and process realtime events. I’m hoping to make a sample Shiny app that will update my location on a map every time I tap the DO button and also compute some stats for it. I also need to put some polish on the receiver interface (add logging and some error checking).
If you have cool ideas or have hooked R up to IFTTT’s Maker channel to do something cool, drop a note in the comments and definitely submit any pull requests or issues on github.
To leave a comment for the author, please follow the link and comment on his blog: rud.is » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Segmentation of methylation profiles using methylKit
(This article was first published on Recipes, scripts and genomics, and kindly contributed to Rbloggers)
methylKit is an R package for DNA methylation analysis and annotation using highthroughput bisulfite sequencing data. We recently included a new function in methylKit called methSeg(). The function can segment methylation profiles (methylRaw objects) or differential methylation profiles (methylDiff objects). Currently the function is available on a different branch for development. You can install the version of methylKit that has the segmentation functionality as shown below. We will move it to the master branch once we have more feedback for the function.
Rationale
/* Style Definitions */ table.MsoNormalTable {msostylename:"Table Normal"; msotstylerowbandsize:0; msotstylecolbandsize:0; msostylenoshow:yes; msostyleparent:""; msopaddingalt:0in 5.4pt 0in 5.4pt; msoparamargin:0in; msoparamarginbottom:.0001pt; msopagination:widoworphan; fontsize:12.0pt; fontfamily:"Times New Roman"; msoasciifontfamily:Cambria; msoasciithemefont:minorlatin; msofareastfontfamily:"ＭＳ 明朝"; msofareastthemefont:minorfareast; msohansifontfamily:Cambria; msohansithemefont:minorlatin;}
Regulatory regions can be discovered by analyzing methylation patterns. One of the popular methods is to segment methylomes into distinct categories such as regions with low methylation or high methylation. The most popular method for categorical methylome segmentation is based on hidden Markov models (HMM) (an example is here). Another segmentation strategy is to use variants of changepoint analysis, where change points in a signal across the genome is extracted and the genome is segmented to regions between two change points These methods are typically used in CNV (copynumber variation) detection but have applications in methylation context as well.
In the context of methylation, we first extract segments between change points of the signal (methylation profiles) then cluster those segments into groups based on the mean methylation values of the segments. In this case, a segment is a region of the genome where CpGs have similar methylation values. We cluster those segments into segment groups using mixture modeling. The segment groups correspond to highly methylated regions or lowly methylated regions. Since this is an unsupervised method, we can find more patterns than just highly or lowly methylated regions. The users have to further overlap the segments with other marks and annotation such as promoters, H3K27ac or other histone modifications to associate segment groups with functional annotation. Use case and example Below, we are using methylation data from Roadmap Epigenomics H1 cell line. The code demonstrates how to use the function and export the results to a BED file. Similarly, we can apply this function on differential methylation profiles, where values range between 100 and 100. Do help(methSeg) to see the help page for the function.
To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R 3.2.1 is released
(This article was first published on Rstatistics blog » R, and kindly contributed to Rbloggers)
R 3.2.1 (codename “WorldFamous Astronaut”) was released yesterday. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.
Upgrading to R 3.2.1 on WindowsIf you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:
install.packages("installr") # install installr::updateR() # updating R.Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package.
I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.
CHANGES IN R 3.2.1:NEW FEATURES
 utf8ToInt() now checks that its input is valid UTF8 and returns NA if it is not.
 install.packages() now allows type = "both" with repos = NULL if it can infer the type of file.
 nchar(x, *) and nzchar(x) gain a new argument keepNA which governs how the result for NAs in x is determined. For the R 3.2.x series, the default remains FALSE which is fully back compatible. From R 3.3.0, the default will change to keepNA = NA and you are advised to consider this for code portability.
 news() more flexibly extracts dates from package ‘NEWS.Rd’ files.
 lengths(x) now also works (trivially) for atomic x and hence can be used more generally as an efficient replacement of sapply(x, length) and similar.
 The included version of PCRE has been updated to 8.37, a bugfix release.
 diag() no longer duplicates a matrix when extracting its diagonal.
 as.character.srcref() gains an argument to allow characters corresponding to a range of source references to be extracted.
 acf() and ccf() now guarantee values strictly in [1,1] (instead of sometimes very slightly outside). PR#15832.
 as.integer("111111111111") now gives NA (with a warning) as it does for the corresponding numeric or negative number coercions. Further, as.integer(M + 0.1) now gives M (instead of NA) when M is the maximal representable integer.
 On some platforms nchar(x, "c") and nchar(x, "w") would return values (possibly NA) for inputs which were declared to be UTF8 but were not, or for invalid strings without a marked encoding in a multibyte locale, rather than give an error. Additional checks have been added to mitigate this.
 apply(a, M, function(u) c(X = ., Y = .)) again has dimnames containing “X” and “Y” (as in R < 3.2.0).
 (Windows only) In some cases, the clean option to R CMD INSTALL could fail. (PR#16178)
 (Windows only) choose.files() would occasionally include characters from the result of an earlier call in the result of a later one. (PR#16270)
 A change in RSiteSearch() in R 3.2.0 caused it to submit invalid URLs. (PR#16329)
 Rscript and command line R silently ignored incomplete statements at the end of a script; now they are reported as parse errors. (PR#16350)
 Parse data for very long strings was not stored. (PR#16354)
 plotNode(), the workhorse of the plot method for "dendrogram"s is no longer recursive, thanks to Suharto Anggono, and hence also works for deeply nested dendrograms. (PR#15215)
 The parser could overflow internally when given numbers in scientific format with extremely large exponents. (PR#16358)
 If the CRAN mirror was not set, install.packages(type = "both") and related functions could repeatedly query the user for it. (Part of PR#16362)
 The lowlevel functions .rowSums() etc. did not check the length of their argument, so could segfault. (PR#16367)
 The quietly argument of library() is now correctly propagated from .getRequiredPackages2().
 Under some circumstances using the internal PCRE when building R fron source would cause external libs such as llzma to be omitted from the main link.
 The .Primitive default methods of the logic operators, i.e., !, & and , now give correct error messages when appropriate, e.g., for `&`(TRUE) or `!`(). (PR#16385)
 cummax(x) now correctly propagates NAs also when x is of type integer and begins with an NA.
 summaryRprof() could fail when the profile contained only two records. (PR#16395)
 HTML vignettes opened using vignette() did not support links into the rest of the HTML help system. (Links worked properly when the vignette was opened using browseVignettes() or from within the help system.)
 arima(*, xreg = .) (for d >= 1) computes estimated variances based on a the number of effective observations as in R version 3.0.1 and earlier. (PR#16278)
 slotNames(.) is now correct for "signature" objects (mostly used internally in methods).
 On some systems, the first string comparison after a locale change would result in NA.
To leave a comment for the author, please follow the link and comment on his blog: Rstatistics blog » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...