R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 4 hours 25 min ago

URL Originality Analysis

Tue, 2015-06-23 11:11

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Chris Campbell, Senior Consultant.

When I first heard about Twitter, it was described to me as love, life and the story of the world, summarized in 140 characters or less. Each snippet of information is like a beautiful haiku, perfectly capturing the essence of events. A moment captured in amber.

Of course now I know what Twitter is, perhaps less idyllic, but still interesting. Chris Musselle (who wrote the first blog post in this series) had been doing some investigation of what data science tools people are talking about. I was able to grab the data as a simple CSV.

A striking feature of the dataset that Chris captured using his Twitter-mining suite is that most tweets contain a URL. In fact, some tweets contain more than one URL. I wrote a simple function to list these, and then looked at the table of counts. More than 19,000 URLs were tweeted during 22,500 tweets.

Rather than telling a story in 140 characters, 82% of tweets are cheating the character limit, and using twitter as advertising for rich media. So if the story-telling isn’t happening on Twitter, where is it happening? And what stories are being told? Are users tweeting links that they’ve found on Twitter, or are they sharing links which they have discovered elsewhere?

To determine the originality of a URL, we can estimate the distribution of novel posts and shares using Twitter’s automatic URL shortening feature. Many URLs are rather long, depending on the website file structure, and could easily exceed 140 characters. To allow URLs to be shared, all URLs posted in Twitter are automatically converted to short form. For example,

http://www.r-bloggers.com/am-i-a-data-scientist/

was updated to

http://t.co/XQfmfy0wIR

The host http://t.co will re-route requests to address XQfmfy0wIR to the r-bloggers post. This then leaves space for user comment.

Of the 19,000 URLs:

  • 8,000 are unique short URLs (i.e. manual URL entry)
  • The remaining 11,000 are re-tweeted URLs
  • One short URL (http://projectsuperior.com) was tweeted over 200 times
  • Five short URLs were tweeted more than 50 times
  • The mean URL is shared 2.4 times
  • The median URL is shared once

 

To discover where these short URLs are pointing to, we need to decode the URL to resolve the target destination. There are various tools in R for decoding short URLs. I used the decode_short_url function in the twitteR package. This function requests the URL from a web service and returns the long URL as a string. This can be slow for some sites and took about a second for each URL on average. In addition, not all short URLs were resolved, and took several requests to resolve. About 80 short URLs could not be resolved, perhaps due to the target site moving or being deleted.

I used the data.table package with the cSplit function from the splitstackshape package to reshape the dataset by URL rather than by tweet. I then merged the table of decoded unique short URLs with the reshaped dataset by recoding the short codes with the factor function.

Of the 8,000 unique short URLs:

  • 6,400 are unique long URLs
  • The remaining 1,200 are URLs that were tweeted on more than one unique occasions
  • One URL (PayPal’s software blog) was tweeted on 65 unique occasions
  • Six URLs were tweeted on more than 15 unique occasions
  • The mean URL is tweeted on 1.2 occasions
  • The median URL is tweeted once

This originality analysis identified different types of popular URL communication on Twitter.

Some URLs were very re-tweetable, but not discoverable.

Some URLs were very discoverable.

  • The PayPal blog post was discovered 65 times and tweeted 173 times. A third of these discoveries were retweeted.
  • A presentation on net by Yuichi Ito was discovered 16 times and tweeted 33 times. A quarter of these discoveries were retweeted.
  • A thread in the German language Entwicklerforum was linked on 29 unique occasions. In this case, the thread was being promoted by a single user, Entwickler himself (herself?). None of these discoveries were retweeted.

Popularity by volume definitely does not tell the whole story about how users are interacting with a URL. And discoverability on its own is insufficient to demonstrate interest, as Entwickler would perhaps admit.

The high level view can take a little while to consume from tables of links. I used a combination of tweet volume (word size) and URL discoverability (opacity) to display truly interesting websites using the wordcloud package.

This approach could be useful for prioritizing your lunchtime reading, and separate the genuinely interesting from the spambot!

To read the first blog post in this series click here

To read the second blog post in this series click here

To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Illustrated Guide to ROC and AUC

Tue, 2015-06-23 07:49

(This article was first published on joy of data » R, and kindly contributed to R-bloggers)

(In a past job interview I failed at explaining how to calculate and interprete ROC curves – so here goes my attempt to fill this knowledge gap.) Think of a regression model mapping a number of features onto a real number (potentially a probability). The resulting real number can then be mapped on one of two classes, depending on whether this predicted number is greater or lower than some choosable threshold. Let’s take for example a logistic regression and data on the survivorship of the Titanic accident to introduce the relevant concepts which will lead naturally to the ROC (Receiver Operating Characteristic) and its AUC or AUROC (Area Under ROC Curve).

Titanic Data Set and the Logistic Regression Model

Every record in the data set represents a passenger – providing information on her/his age, gender, class, number of siblings/spouses aboard (sibsp), number of parents/children aboard (parch) and, of course, whether s/he survived the accident.

# https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/read_and_prepare_titanic_dataset.R > df <- read_and_prepare_titanic_dataset("~/Downloads/titanic3.csv") > str(df) 'data.frame': 1046 obs. of 6 variables: $ survived: Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 2 1 ... $ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ... $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ... $ age : num 29 0.92 2 30 25 48 63 39 53 71 ... $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ... $ parch : int 0 2 2 2 2 0 0 0 0 0 ...

The logistic regression model is tested on batches of 10 cases with a model trained on the remaining N-10 cases – the test batches form a partition of the data. In short, Leave-10-out CV has been applied to arrive at more accurate estimation of the out-of-sample error rates.

# https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/log_reg.R > predictions <- log_reg(df, size=10) > str(predictions) 'data.frame': 1046 obs. of 2 variables: $ survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 2 1 2 ... $ pred : num 0.114 0.854 0.176 0.117 0.524 ...

Distribution of the Predictions

Now let’s first have a look at the distribution of survival and death cases on the predicted survival probabilities.

# https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/plot_pred_type_distribution.R > plot_pred_type_distribution(predictions, 0.7)

If we consider survival as a positive (1) and death due to the accident as a negative (0) result, then the above plot illustrates the tradeoff we face upon choosing a reasonable threshold. If we increase the threshold the number of false positive (FP) results is lowered, while the number of false negative (FN) results increases.

Receiver Operating Characteristic

This question of how to balance false positives and false negatives (depending on the cost/consequences of either mistake) arose on a major scale during World War II in context of interpretation of radar signals for identification of enemy air planes. For the purpose of visualizing and quantifying the impact of a threshold on the FP/FN-tradeoff the ROC curve was introduced. The ROC curve is the interpolated curve made of points whose coordinates are functions of the threshold:

In terms of hypothesis tests where rejecting the null hypothesis is considered a positive result the FPR (false positive rate) corresponds to the Type I error, the FNR (false negative rate) to the Type II error and (1 – FNR) to the power. So the ROC for above distribution of predictions would be:

# https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/calculate_roc.R roc <- calculate_roc(predictions, 1, 2, n = 100) # https://github.com/joyofdata/joyofdata-articles/blob/master/roc-auc/plot_roc.R plot_roc(roc, 0.7, 1, 2)

The dashed lines indicate the location of the (FPR, TPR) corresponding to a threshold of 0.7. Note that the low corner (0,0) is associated with a threshold of 1 and the top corner (1,1) with a threshold of 0.

The cost function and the corresponding coloring of the ROC points illustrate that an optimal FPR and TPR combination is determined by the associated cost. Depending on the use case false negatives might be more costly than false positive or vice versa. Here I assumed a cost of 1 for FP cases and a cost of 2 for FN cases.

Area Under (ROC) Curve

The optimal point on the ROC curve is (FPR, TPR) = (0,1). No false positives and all true positives. So the closer we get there the better. The second essential observation is that the curve is by definition monotonically increasing.

This inequation can be easily checked by looking at the first plot by mentally pushing the threshold (red line) up and down; it implies the monotonicity. Furthermore any reasonable model’s ROC is located above the identity line as a point below it would imply a prediction performance worse than random (in that case, simply inverting the predicted classes would bring us to the sunny side of the ROC space).

All those features combined make it apparently reasonable to summarize the ROC into a single value by calculating the area of the convex shape below the ROC curve – this is the AUC. The closer the ROC gets to the optimal point of perfect prediction the closer the AUC gets to 1.

# AUC for the example > library(pROC) > auc(predictions$survived, predictions$pred) Area under the curve: 0.8421

ROC and AUC for Comparison of Classifiers

Mainly two reasons are responsible for why an ROC curve is a potentially powerful metric for comparison of different classifiers. One is that the resulting ROC is invariant against class skew of the applied data set – that means a data set featuring 60% positive labels will yield the same (statistically expected) ROC as a data set featuring 45% positive labels (though this will affect the cost associated with a given point of the ROC). The other is that the ROC is invariant against the evaluated score – which means that we could compare a model giving non-calibrated scores like a regular linear regression with a logistic regression or a random forest model whose scores can be considered as class probabilities.

The AUC furthermore offers interesting interpretations:

The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. [Fawcett]

 

[The AUC] also has various natural intuitive interpretations, one of which is that it is the average sensitivity of a classifier under the assumption that one is equally likely to choose any value of the specificity — under the assumption of a uniform distribution over specificity. [Hand]

As the ROC itself is variable with respect to a given data set it is necessary to average multiple ROCs derived from different data sets to arrive at a good estimation of a classifier’s true ROC function.






Criticism of the AUC

It seems problematic, in the first place, to absolutely measure and compare the performance of classifiers with something as simple as a scalar between 0 and 1. The main fundamental reason of this is that problem specific cost functions hurt the assumption of points in the ROC space being homogenous in that regard and by that comparable across classifiers. This non-uniformity of the cost function causes ambiguities if ROC curves of different classifiers cross and on itself when the ROC curve is compressed into the AUC by means of integration over the false positive rate.

However, the AUC also has a much more serious deficiency, and one which appears not to have been previously recognised. This is that it is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules. It is equivalent to saying that, using one classifier, misclassifying a class 1 point is p times as serious as misclassifying a class 0 point, but, using another classifier, misclassifying a class 1 point is P times as serious, where p = P. This is nonsensical because the relative severities of different kinds of misclassifications of individual points is a property of the problem, not the classifiers which happen to have been chosen. [Hand]

David J. Hand gives a statistically profound reasoning for the dubiousness of the AUC.

Sources

[Fawcett]: “An introduction to ROC analysis” by Tom Fawcett

[Hand]: “Measuring classifier performance: a coherent alternative
to the area under the ROC curve” by David J. Hand

(original article published on www.joyofdata.de)

To leave a comment for the author, please follow the link and comment on his blog: joy of data » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

arXiv frenzy

Tue, 2015-06-23 07:35

(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)

In the few past days, there has been so many arXiv postings of interest—presumably the NIPS submission effect!—that I cannot hope to cover them in the coming weeks! Hopefully, some will still come out on the ‘Og in a near future:

  • arXiv:1506.06629: Scalable Approximations of Marginal Posteriors in Variable Selection by Willem van den Boom, Galen Reeves, David B. Dunson
  • arXiv:1506.06285: The MCMC split sampler: A block Gibbs sampling scheme for latent Gaussian models by Óli Páll Geirsson, Birgir Hrafnkelsson, Daniel Simpson, Helgi Sigurðarson
  • arXiv:1506.06268: Bayesian Nonparametric Modeling of Higher Order Markov Chains by Abhra Sarkar, David B. Dunson
  • arXiv:1506.06117: Convergence of Sequential Quasi-Monte Carlo Smoothing Algorithms by Mathieu Gerber, Nicolas Chopin
  • arXiv:1506.06101: Robust Bayesian inference via coarsening by Jeffrey W. Miller, David B. Dunson
  • arXiv:1506.05934: Expectation Particle Belief Propagation by Thibaut Lienart, Yee Whye Teh, Arnaud Doucet
  • arXiv:1506.05860: Variational Gaussian Copula Inference by Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin
  • arXiv:1506.05855: The Frequentist Information Criterion (FIC): The unification of information-based and frequentist inference by Colin H. LaMont, Paul A. Wiggins
  • arXiv:1506.05757: Bayesian Inference for the Multivariate Extended-Skew Normal Distribution by Mathieu Gerber, Florian Pelgrin
  • arXiv:1506.05741: Accelerated dimension-independent adaptive Metropolis by Yuxin Chen, David Keyes, Kody J.H. Law, Hatem Ltaief
  • arXiv:1506.05269: Bayesian Survival Model based on Moment Characterization by Julyan Arbel, Antonio Lijoi, Bernardo Nipoti
  • arXiv:1506.04778: Fast sampling with Gaussian scale-mixture priors in high-dimensional regression by Anirban Bhattacharya, Antik Chakraborty, Bani K. Mallick
  • arXiv:1506.04416: Bayesian Dark Knowledge by Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling [a special mention for this title!]
  • arXiv:1506.03693: Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference by Edward Meeds, Max Welling
  • arXiv:1506.03074: Variational consensus Monte Carlo by Maxim Rabinovich, Elaine Angelino, Michael I. Jordan
  • arXiv:1506.02564: Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families by Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton [comments coming soon!]

Filed under: R, Statistics, University life Tagged: arXiv, Bayesian statistics, MCMC, Monte Carlo Statistical Methods, Montréal, NIPS 2015, particle filter

To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The 2D-Harmonograph In Shiny

Tue, 2015-06-23 06:00

(This article was first published on Ripples, and kindly contributed to R-bloggers)

If you wish to make an apple pie from scratch, you must first invent the universe (Carl Sagan)

I like Shiny and I can’t stop converting into apps some of my previous experiments: today is the turn of the harmonograph. This is a simple application since you only can press a button to generate a random harmonograph-simulated curve. I love how easy is to create a nice interactive app to play with from an existing code. The only trick in this case in to add a rerun button in the UI side and transfer the interaction to the server side using a simple if. Very easy. This is a screenshot of the application:

Press the button and you will get a new drawing. Most of them are nice scrawls and from time to time you will obtain beautiful shapely curves.

And no more candy words: It is time to complain. I say to RStudio with all due respect, you are very cruel. You let me to deploy my previous app to your server but you suspended it almost immediately for fifteen days due to “exceeded usage hours”. My only option is paying at least $440 per year to upgrade my current plan. I tried the ambrosia for an extremely short time. RStudio: Why don’t you launch a cheaper account? Why don’t you launch a free account with just one perpetual alive app at a time? Why don’t you increase the usage hours threshold? I can help you to calculate the return on investment of these scenarios.

Or, Why don’t you make me a gift for my next birthday? I promise to upload a new app per month to promote your stunning tool. Think about it and please let me know your conclusions.

Meanwhile I will run the app privately. This is the code to do it:

UI.R

# This is the user-interface definition of a Shiny web application. # You can find out more about building applications with Shiny here: # # http://www.rstudio.com/shiny/ library(shiny) shinyUI(fluidPage( titlePanel("Mathematical Beauties: The Harmonograph"), sidebarLayout( sidebarPanel( #helpText(), # adding the new div tag to the sidebar tags$div(class="header", checked=NA, tags$p("A harmonograph is a mechanical apparatus that employs pendulums to create a geometric image. The drawings created typically are Lissajous curves, or related drawings of greater complexity. The devices, which began to appear in the mid-19th century and peaked in popularity in the 1890s, cannot be conclusively attributed to a single person, although Hugh Blackburn, a professor of mathematics at the University of Glasgow, is commonly believed to be the official inventor. A simple, so-called "lateral" harmonograph uses two pendulums to control the movement of a pen relative to a drawing surface. One pendulum moves the pen back and forth along one axis and the other pendulum moves the drawing surface back and forth along a perpendicular axis. By varying the frequency and phase of the pendulums relative to one another, different patterns are created. Even a simple harmonograph as described can create ellipses, spirals, figure eights and other Lissajous figures (Source: Wikipedia)")), tags$div(class="header", checked=NA, HTML("<p>Click <a href="http://paulbourke.net/geometry/harmonograph/harmonograph3.html">here</a> to see an image of a real harmonograph</p>") ), actionButton('rerun','Launch the harmonograph!') ), mainPanel( plotOutput("HarmPlot") ) ) ))

server.R

# This is the server logic for a Shiny web application. # You can find out more about building applications with Shiny here: # # http://www.rstudio.com/shiny/ # library(shiny) CreateDS = function () { f=jitter(sample(c(2,3),4, replace = TRUE)) d=runif(4,0,1e-02) p=runif(4,0,pi) xt = function(t) exp(-d[1]*t)*sin(t*f[1]+p[1])+exp(-d[2]*t)*sin(t*f[2]+p[2]) yt = function(t) exp(-d[3]*t)*sin(t*f[3]+p[3])+exp(-d[4]*t)*sin(t*f[4]+p[4]) t=seq(1, 200, by=.0005) data.frame(t=t, x=xt(t), y=yt(t))} shinyServer(function(input, output) { dat<-reactive({if (input$rerun) dat=CreateDS() else dat=CreateDS()}) output$HarmPlot<-renderPlot({ plot(rnorm(1000),xlim =c(-2,2), ylim =c(-2,2), type="n") with(dat(), plot(x,y, type="l", xlim =c(-2,2), ylim =c(-2,2), xlab = "", ylab = "", xaxt='n', yaxt='n', col="gray10", bty="n")) }, height = 650, width = 650) })

To leave a comment for the author, please follow the link and comment on his blog: Ripples. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Next Kölner R User Meeting: Friday, 26 June 2015

Tue, 2015-06-23 01:47

(This article was first published on mages' blog, and kindly contributed to R-bloggers)


The next Cologne R user group meeting is scheduled for this Friday, 6 June 2015 and we have an exciting agenda with two talks followed by networking drinks.

  • Data Science at the Commandline (Kirill Pomogajko)
  • An Introduction to RStan and the Stan Modelling Language (Paul Viefers)

Please note: Our venue changed! We have outgrown the seminar room at the Institute of Sociology and move to Startplatz, a start-up incubator venue: Im Mediapark, 550670 Köln
Drinks and Networking

The event will be followed by drinks (Kölsch!) and networking opportunities.

For further details visit our KölnRUG Meetup site. Please sign up if you would like to come along. Notes from past meetings are available here.

The organisers, Bernd Weiß and Markus Gesmann, gratefully acknowledge the sponsorship of Revolution Analytics, who support the Cologne R user group as part of their Matrix programme.

This post was originally published on mages’ blog.

To leave a comment for the author, please follow the link and comment on his blog: mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Randomized Hobbit

Mon, 2015-06-22 18:20

(This article was first published on The stupidest thing... » R, and kindly contributed to R-bloggers)

@wrathematics pointed me to his ngram R package for constructing and simulating from n-grams from text.

I’d recently grabbed the text of the hobbit, and so I applied it to that text, with amusing results.

Here’s the code I used to grab the text.

library(XML) stem <- "http://www.5novels.com/classics/u5688" hobbit <- NULL for(i in 1:74) {     cat(i,"n")     if(i==1) {         url <- paste0(stem, ".html")     } else {         url <- paste0(stem, "_", i, ".html")     }     x <- htmlTreeParse(url, useInternalNodes=TRUE)     xx <- xpathApply(x, "//p", xmlValue)     hobbit <- c(hobbit, gsub("r", "", xx[-length(xx)]))     Sys.sleep(0.5) }

Then calculate the ngrams with n=2.

library(ngram) ng2 <- ngram(hobbit, n=2)

Simulate some number of words with babble(). If you use the seed argument, you can get the result reproducibly.

babble(ng2, 48, seed=53482175)

into trees, and then bore to the Mountain to go through?” groaned the hobbit. “Well, are you doing, And where are you doing, And where are you?” it squeaked, as it was no answer. They were surly and angry and puzzled at finding them here in their holes

Update: @wrathematics suggested that I mix two texts, so here’s a bit from the Hobbit in the Hat (The Hobbit with 59× Cat in the Hat — up-sampled to match lengths.) But there’s maybe not enough overlap between the two texts to get much of a mixture.

“I am Gandalf,” said the fish. This is no way at all!

already off his horse and among the goblin and the dragon, who had remained behind to guard the door. “Something is outside!” Bilbo’s heart jumped into his boat on to sharp rocks below; but there was a good game, Said our fish No! No! Those Things should not fly.

To leave a comment for the author, please follow the link and comment on his blog: The stupidest thing... » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Looking for Preference in All the Wrong Places: Neuroscience Suggests Choice Model Misspecification

Mon, 2015-06-22 13:03

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

At its core, choice modeling is a utility estimating machine. Everything has a value reflected in the price that we are willing to pay in order to obtain it. Here are a collection of Smart Watches from a search of Google Shopping. You are free to click on any one, look for more, or opt out altogether and buy nothing.

Where is the utility? It is in the brand name, the price, the user ratings and any other feature that gets noticed. If you pick only the Apple Smartwatch at its lowest price, I conclude that brand and price have high utility. It is a somewhat circular definition: I know that you value the Apple brand because you choose it, and you pick the Apple watch because the brand has value. We seem to be willing to live with such circularity as long as utility measured in one setting can be generalized over occasions and conditions. However, context matters when modeling human judgment and choice, making generalization a difficult endeavor. Utility theory is upended when higher prices alter perceptions so that the same food tastes better when it costs more.

What does any of this have to do with neuroscience? Utility theory was never about brain functioning. Glimcher and Fehr make this point in their brief history of neuroeconomics. Choice modeling is an “as if” theory claiming only that decision makers behave as if they assigned values to features and selected the option with the optimal feature mix.

When the choice task has been reduced to a set of feature comparisons as is the common practice in most choice modeling, the process seems to work at least in the laboratory (i.e., care must be taken to mimic the purchase process and adjustment may be needed when making predictions about real-world market shares). Yet, does this describe what one does when looking at the above product display from Google Shopping? I might try to compare the ingredients listed on the back of two packages while shopping in the supermarket. However, most of us find this task quickly becomes too difficult as the number of features exceeds our short-term memory limits (paying attention is costly).

Incorporating Preference Construction

Neuroeconomics suggests how value is constructed on-the-fly in real world choice tasks. Specifically, reinforcement learning is supported by multiple systems within the brain: “dual controllers” for both the association of features and rewards (model-free utilities) and the active evaluation of possible futures (model-based search). Daw, Niv and Dayan identified the two corresponding regions of the brain and summarized the supporting evidence back in 2005.

Features can become directly tied to value so that the reward is inferred immediately from the presence of the feature. Moreover, if we think of choice modeling only as the final stages when we are deciding among a small set of alternatives in a competitive consideration set, we might mistakenly conclude that utility maximization describes decision making. As in the movies, we may wish to “flashback” to the beginning of the purchase process to discover the path that ended at the choice point where features seem to dominate the representation.

Perception, action and utility are all entangled in the wild, as shown by the work of Gershman and Daw. Attention focuses on the most or least desirable features in the context of the goals we wish to achieve. We simulate the essentials of the consumption experience and ignore the rest. Retrospection is remembering the past, and prospection is experiencing the future. The steak garners greater utility sizzling than raw because it is easier to imagine the joy of eating it.

While the cognitive scientist wants to model the details of this process, the marketer will be satisfied learning enough to make the sale and keep the customer happy. In particular, marketing tries to learn what attracts attention, engages interest and consideration, generates desire and perceived need, and drives purchase while retaining customers (i.e., the AIDA model). These are the building blocks from which value is constructed.

Choice modeling, unfortunately, can identify the impact of features only within the confines of a single study, but it encounters difficulties attempting to generalize any effects beyond the data collected. Many of us are troubled that even relatively minor changes can alter the framing of the choice task or direct attention toward a previously unnoticed aspect (Attention and Reference Dependence).

The issue is not one of data collection or statistical analysis. The R package support.BWS will assist with the experimental design, and other R packages such as bayesm, RSGHB and ChoiceModelR will estimate the parameters of a hierarchical Bayes model. No, the difficulty stems from needing to present each respondent with multiple choice scenarios. Even if we limit the number of choice sets that any one individual will evaluate, we are still forced to simplify the task in order to show all the features for all the alternatives in the same choice set. In addition, multiple choice sets impose some demands for consistency so that a choice strategy that can be used over and over again without a lot of effort is preferred by respondents just to get through the questionnaire. On the other hand, costly information search is eliminated, and there is no anticipatory regret or rethinking one’s purchase since there is no actual transfer of money. In the end, our choice model is misspecified for two reasons: it does not include the variables that drive purchase in real markets and the reactive effects of the experimental arrangements create confounding effects that do not occur outside of the study.

Measuring the Forces Shaping Preference

Consumption is not random but structured by situational need and usage occasion. “How do you intend to use your Smartwatch?” is a good question to ask when you begin your shopping, although we will need to be specific because small differences in usage can make a large difference in what is purchased. To be clear, we are not looking for well-formed preferences, for instance, feature importance or contribution to purchase. Instead, we focus on attention, awareness and familiarity that might be precursors or early phases of preference formation. If you own an iPhone, you might never learn about Android Wear. What, if anything, can we learn from the apps on your Smartphone?

I have shown how the R package NMF for nonnegative matrix factorization can uncover these building blocks. We might wish to think of NMF as a form of collaborative filtering, not unlike a recommender system that partitions users into cliques or communities and products into genres or types (e.g., sci-fi enthusiasts and the fantasy/thrillers they watch). An individual pattern of awareness and familiarity is not very helpful unless it is shared by a larger community with similar needs arising from common situations. Product markets evolve over time by appealing to segments willing to pay more for differentiated offerings. In turn, the new offerings solidify customer differences. This separation of products and users into segregated communities forms the scaffolding used to construct preferences, and this is where we should begin our research and statistical analysis.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

DeployR Data I/O

Mon, 2015-06-22 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Sean Wells, Senior Software Engineer, Microsoft and David Russell

DeployR exists to solve a number of fundamental R analytics integration problems faced by application developers. For example, have you ever wondered how you might execute an R script from within a Web-based dashboard, an enterprise middleware solution, or a mobile application? DeployR makes it very simple. In fact, DeployR makes it very simple for any application developed in any language to:

  1. Provision one or more dedicated R sessions on demand
  2. Execute R scripts on those R sessions
  3. Pass inputs (data, files, etc.) when executing those R scripts
  4. Retrieve outputs (data, files, plots, etc.) following the execution of those R scripts

DeployR offers these features and many more through a set of Analytics Web services. These Web service interfaces isolate the application developer and the application itself from R code, from the underlying R sessions, in fact from all the complexities typically associated with R integration. DeployR client libraries currently available in Java, JavaScript and .NET make integration for application developers a snap. As application developers, we can leave R coding and model building to the data scientists. Now we can also leave all aspects of R session management to the DeployR server. This frees us up to focus on simple integrations with DeployR services that deliver the phenomenal power of R directly within our applications.

But what we want to focus on in this post is the rich set of data inputs and data outputs than can be passed and retrieved when executing R scripts on DeployR. And rather than tell you how you can work with inputs and outputs we think it's better to show you. Which is why we have prepared extensive tutorials on DeployR Data I/O which includes full source code and run instructions available on github.

Taking just one example application from these tutorials, RepoFileInEncodedDataOut, let's briefly explore what's going on. The application generates the following console output:

Code Console Output CONFIGURATION Using endpoint http://localhost:7400/deployr CONFIGURATION Using broker config [ PooledBrokerConfig ] CONNECTION Established authenticated pooled broker [ RBroker ] TASK INPUT Repository binary file input set on task, [ PooledTaskOptions.preloadWorkspace ] TASK OPTION DeployR-encoded R object request set on task [ PooledTaskOptions.routputs ] EXECUTION Pooled task submitted to broker [ RTask ] TASK RESULT Pooled task completed in 1833ms [ RTaskResult ] TASK OUTPUT Retrieved DeployR-encoded R object output hip [ RDataFrame ] TASK OUTPUT Retrieved DeployR-encoded R object output hipDim [ RNumericVector ] TASK OUTPUT Retrieved DeployR-encoded R object hipDim value=[2719.0, 9.0] TASK OUTPUT Retrieved DeployR-encoded R object output hipNames [ RStringVector ] TASK OUTPUT Retrieved DeployR-encoded R object hipNames value=[HIP, Vmag, RA, DE, Plx, pmRA, pmDE, e_Plx, B.V]

This console output can be interpreted as follows:

  1. CONFIGURATION & CONNECTION – the application connects and provisions R sessions on DeployR
  2. TASK INPUT – the application identifies a repository-managed file for auto-loading prior to task execution
  3. TASK OPTION – the application identifies the set of R workspace objects to be returned following task execution
  4. EXECUTION – the application executes the task
  5. TASK OUTPUT – the application retrieves the set of R workspace objects returned following task execution

Take a look at the Java source code for this particular example application here. As with all examples, the source code is heavily documented to help you understand the implementation step-by-step.

These tutorials are currently written in Java but the capabilities and mechanisms demonstrated apply equally across all languages and platforms. Check out the tutorials and let us know what you think.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Editing Dockerfiles with Architect

Mon, 2015-06-22 08:36

(This article was first published on Open Analytics - Blog, and kindly contributed to R-bloggers)

Monday 22 June 2015 – 14:36

In the last two years, software development had taken a step forward with the advent of the Docker. For the uninitiated, a Docker is a tool that automates the deployment of applications by packaging them with their dependencies in a virtual container, eliminating the need for virtual machines.

Docker has streamlined the process of application development on Linux servers, and Open Analytics has streamlined the creation of Docker files with Architect’s new Dockerfile Editor.

Installation

To install the Dockerfile Editor, first click Help -> Install New Software

Next, add the URL http://docker-editor.openanalytics.eu/update as a new repository.

Finally, select the “Dockerfile Editor” and click next. NOTE: You must first de-select the “Group items by category” option.

Using the Dockerfile Editor

For basic instructions on using the Docker Editor, please visit out guide, Getting Started with Dockerfile Editor.

Additional Resources:

To leave a comment for the author, please follow the link and comment on his blog: Open Analytics - Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New Shiny cheat sheet and video tutorial

Mon, 2015-06-22 08:28

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’ve added two new tools that make it even easier to learn Shiny.

Video tutorial

The How to Start with Shiny training video provides a new way to teach yourself Shiny. The video covers everything you need to know to build your own Shiny apps. You’ll learn:

  • The architecture of a Shiny app
  • A template for making apps quickly
  • The basics of building Shiny apps
  • How to add sliders, drop down menus, buttons, and more to your apps
  • How to share Shiny apps
  • How to control reactions in your apps to
    • update displays
    • trigger code
    • reduce computation
    • delay reactions
  • How to add design elements to your apps
  • How to customize the layout of an app
  • How to style your apps with CSS

Altogether, the video contains two hours and 25 minutes of material organized around a navigable table of contents.

Best of all, the video tutorial is completely free. The video is the result of our recent How to Start Shiny webinar series. Thank you to everyone who attended and made the series a success!

Watch the new video tutorial here.

New cheat sheet

The new Shiny cheat sheet provides an up-to-date reference to the most important Shiny functions.

The cheat sheet replaces the previous cheat sheet, adding new sections on single-file apps, reactivity, CSS and more. The new sheet also gave us a chance to apply some of the things we’ve learned about making cheat sheets since the original Shiny cheat sheet came out.

Get the new Shiny cheat sheet here.

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

Mon, 2015-06-22 01:37

(This article was first published on Stable Markets » R, and kindly contributed to R-bloggers)

The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.

I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository.

Here are some preliminary descriptive statistics:

The data shows some interesting trends:

  • Stops had been increasing steadily from 2003 to 2012, but falling since 2012.
  • The percentage of stopped persons who were black was consistently 3.5-6.5 times higher than the percentage of stopped persons who were white.
  • The data indicates whether or not officers explained the reason for stop to the stopped person. The data shows that police gave an explanation about 98-99% of the time. Of course, this involves a certain level of trust since the data itself is recorded by police. There is no difference in this statistic across race and sex.
  • The median age of stopped persons was 24. The distribution was roughly the same across race and sex.

A few notes on the data:

  • The raw data is saved as CSV files, one file for each year. However, the same variables are not tracked in each year. The .RData file on Github only contains select variables.
  • The importing and cleaning codes can take about 15 minutes to run.
  • All stops in all years have coordinates marking the location of the stop, however I’m still unable to make sense of them. I plan to publish another post with some spatial analyses.

The coding for this was particularly interesting because I had never used R to download ZIP files from the web. I reproduced this portion of the code below. It produces one dataset for each year from 2013 to 2014.

for(i in 2013:2014){ temp <- tempfile() url<-paste("http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/",i,"_sqf_csv.zip",sep='') download.file(url,temp) assign(paste("d",i,sep=''),read.csv(unz(temp, paste(i,".csv",sep='')))) } unlink(temp)

To leave a comment for the author, please follow the link and comment on his blog: Stable Markets » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

12 new R jobs (2015-06-22) – from @California to @Nottingham

Mon, 2015-06-22 01:21

This is the bimonthly post (for 2015-06-22) for new R Jobs from R-users.com.

Employers: visit this link to post a new R job to the R community (it’s free and quick).

Job seekers: please follow the links below to learn more and apply for your job of interest (or visit previous R jobs posts).

  1. Full-Time
    Data Scientist (@Nottingham)
    Capital One – Posted by S.Kugadasan
    Nottingham
    England, United Kingdom
    19 Jun2015
  2. Full-Time
    Lead Data Scientist (@Nottingham)
    Capital One – Posted by S.Kugadasan
    Nottingham
    England, United Kingdom
    19 Jun2015
  3. Full-Time
    Instructional Technologist (Quantitative Applications) @ReedCollege @Portland
    Reed College – Posted by KBott
    Portland
    Oregon, United States
    19 Jun2015
  4. Part-Time
    Data Scientist Intern at eBay (@Netanya)
    ebay Marketplace Israel – Posted by ebay
    Netanya
    Center District, Israel
    18 Jun2015
  5. Full-Time
    Statistician Intermediate at Rice University (@Houston)
    Children’s Environmental Health Initiative – Posted bykafi@umich.edu
    Houston
    Texas, United States
    18 Jun2015
  6. Full-Time
    Data Scientist (@TelAviv)
    ubimo – Posted by Tal
    Tel Aviv-Yafo
    Tel Aviv District, Israel
    18 Jun2015
  7. Full-Time
    Data Scientist/Statistician (@KfarSaba)
    Haimke
    Kefar Sava
    Center District, Israel
    17 Jun2015
  8. Full-Time
    R Programmer / Data Analyst (@SanDiego)
    University of California San Diego Dept. of Radiation Medicine – Posted by lmell
    San Diego
    California, United States
    16 Jun2015
  9. Full-Time
    Data Scientist (@Prague)
    CGI – Posted by CGI
    Prague
    Hlavní město Praha, Czech Republic
    16 Jun2015
  10. Full-Time
    R Programmer (@Connecticut)
    Paul Dai – Real Staffing – Posted by Paul Dai
    Ridgefield
    Connecticut, United States
    11 Jun2015
  11. Full-Time
    Data Scientist (@Holon)
    Yoni Schamroth / Perion Networks – Posted by Yoni Schamroth
    Holon
    Center District, Israel
    11 Jun2015
  12. Full-Time
    Quantitative Research Assistant for predicting the effects of public policy
    American Enterprise Institute – Posted by clairegaut
    Anywhere
    10 Jun2015

 

 

Categories: Methodology Blogs

Trading Moving Averages with Less Whipsaws

Sun, 2015-06-21 23:35

(This article was first published on Quintuitive » R, and kindly contributed to R-bloggers)

Using a simple moving average to time markets has been a successful strategy over a very long period of time. Nothing to brag home about, but it cuts the drawdown of a buy and hold by about a half, sacrificing less than 1% of the CAGR in the process. In two words, simple yet effective. Here are the important numbers (using the S&P 500 index from 1994 to 2013, inclusive):

Statitistic SMA 200 Buy and Hold CAGR 6.51% 7.13% Sharpe Ratio 0.56 0.37 Sortino Ratio 0.05 0.04 Pct in Market

70.14 99.94 Winning Pct 54.61% 53.84% Avg Drawdown 2.13% 2.48% Avg Annual Drawdown 9.12% 15.83% Max Drawdown 28.28% 56.78% Gain to Pain 0.71 0.45

The moving average system is better by far. First, we can leverage it 2:1 (thus, bringing CAGR up to approximately 13%) and still, our maximum drawdown is going to be comparable to the buy and hold. Furthermore, it’s in the market only 70% – less risk and probably the returns can be further boosted by putting the money to work in alternative assets.

One problem with these type of systems are whipsaws. The system works well when the price stays away from the moving average. However, in close proximity, one may have to enter/exit in succession losing money on the way.

One way to address this issue is to use an alternative “line” to trigger the exits (or the entries, for that matter). It could be a percentage band, but that’s hardly universal. Better to bring volatility into picture.

Let’s do something more robust as an illustration.

require(quantmod) require(btutils) # For construct.indicator # Download data gspc = getSymbols("^GSPC",from="1900-01-01",auto.assign=F) # Compute the returns rets = ROC(Cl(gspc),type="discrete") # Compute the adjusted returns adj.rets = sqrt(runSum(rets*rets,10)/9) # The moving average sma = runMean(adj.rets,n=200) # The standard deviation stddev = sqrt(runSum(adj.rets*adj.rets,200)/199) # Long trades are entered when the average turns positive upper.band = 0.0 # Long trades are closed when the average return goes below -0.5 standard deviations lower.band = -0.05*stddev # For the "uncushioned" version use # lower.band = 0 uu = ifelse(sma>upper.band,1,0) dd = ifelse(sma<lower.band,-1,0) long.entries = (uu == 1 & lag(uu) == 0) long.exits = (dd == -1 & lag(dd) == 0) short.entries = long.entries short.exits = long.entries short.entries[] = FALSE short.exits[] = FALSE # Given entries and exits (both for long and short positions), this function builds # the corresponding indicator. It's pure C++, so it's fast too. ind = btutils::construct.indicator(long.entries, long.exits, short.entries, short.exits)

First we apply the moving average not to the prices, but to the returns (an interpretation of David Varadi’s Error Adjusted Momentum). The “cushioned” system goes long when the average becomes positive, but to exit, it leaves some cushion below the moving average. What do we gain?

Statitistic Cushioned EA EA SMA 200 Buy and Hold CAGR 10.93% 9.11% 6.51% 7.13% Sharpe Ratio 0.80 0.69 0.56 0.37 Sortino Ratio 0.07 0.06 0.05 0.04 Pct in Market 80.28% 76.29% 70.14% 99.94% Winning Pct 55.21% 55.00% 54.61% 53.84% Avg Drawdown 1.88% 2.05% 2.13% 2.48% Avg Annual Drawdown 8.90% 9.26% 9.12% 15.83% Max Drawdown 19.39% 19.57% 28.28% 56.78% Gain to Pain 1.23 0.98 0.71 0.45

To compare apples to apples, we added two systems. The first one (dubbed EA) uses error adjusted returns to compute the SMA, but enters and exits when the SMA crosses the 0 line. The “cushioned” version is the system implemented by the above code.

Even after taking into account that the new strategies stays longer in the market, there seems to be slight improvement. But that’s not the point. The “cushioned” approach did 4 trades in total – it exited for the two bear markets. That’s about 4 trades. The non-cushioned approach had 80 trades. Mission accomplished in other words.

The post Trading Moving Averages with Less Whipsaws appeared first on Quintuitive.

To leave a comment for the author, please follow the link and comment on his blog: Quintuitive » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Organize a walk around London with R

Sun, 2015-06-21 06:25

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

The subtitle of this post can be “How to plot multiple elements on interactive web maps in R“.
In this experiment I will show how to include multiple elements in interactive maps created using both plotGoogleMaps and leafletR. To complete the work presented here you would need the following packages: sp, raster, plotGoogleMaps and leafletR.

I am going to use data from the OpenStreet maps, which can be downloaded for free from this website: weogeo.com
In particular I downloaded the shapefile with the stores, the one with the tourist attractions and the polyline shapefile with all the roads in London. I will assume that you want to spend a day or two walking around London, and for this you would need the location of some hotels and the locations of all the Greggs in the area, for lunch. You need to create a web map that you can take with you when you walk around the city with all these customized elements, that’s how you create it.

Once you have downloaded the shapefile from weogeo.com you can open them and assign the correct projection with the following code:

stores <- shapefile("weogeo_j117529/data/shop_point.shp")
projection(stores)=CRS("+init=epsg:3857")
 
roads <- shapefile("weogeo_j117529/data/route_line.shp")
projection(roads)=CRS("+init=epsg:3857")
 
tourism <- shapefile("weogeo_j117529/data/tourism_point.shp")
projection(tourism)=CRS("+init=epsg:3857")

To extract only the data we would need to the map we can use these lines:

Greggs <- stores[stores$NAME %in% c("Gregg's","greggs","Greggs"),]
 
Hotel <- tourism[tourism$TOURISM=="hotel",]
Hotel <- Hotel[sample(1:nrow(Hotel),10),]
 
 
Footpaths <- roads[roads$ROUTE=="foot",]

plotGoogleMaps
I created three objects, two are points (Greggs and Hotel) and the last is of class SpatialLinesDataFrame. We already saw how to plot Spatial objects with plotGoogleMaps, here the only difference is that we need to create several maps and then link them together.
Let’s take a look at the following code:

Greggs.google <- plotGoogleMaps(Greggs,iconMarker=rep("http://local-insiders.com/wp-content/themes/localinsiders/includes/img/tag_icon_food.png",nrow(Greggs)),mapTypeId="ROADMAP",add=T,flat=T,legend=F,layerName="Gregg's",fitBounds=F,zoom=13)
Hotel.google <- plotGoogleMaps(Hotel,iconMarker=rep("http://www.linguistics.ucsb.edu/projects/weal/images/hotel.png",nrow(Hotel)),mapTypeId="ROADMAP",add=T,flat=T,legend=F,layerName="Hotels",previousMap=Greggs.google)
 
plotGoogleMaps(Footpaths,col="dark green",mapTypeId="ROADMAP",filename="Multiple_Objects_GoogleMaps.html",legend=F,previousMap=Hotel.google,layerName="Footpaths",strokeWeight=2)

As you can see I first create two objects using the same function and then I call again the same function to draw and save the map. I can link the three maps together using the option add=T and previousMap.
We need to be careful here though, because the use of the option add is different from the standard plot function. In plot I can call the first and then if I want to add a second I call again the function with the option add=T. Here this option needs to go in the first and second calls, not in the last. Basically in this case we are saying to R not to close the plot because later on we are going to add elements to it. In the last line we do not put add=T, thus saying to R to go ahead and close the plot.

Another important option is previousMap, which is used starting from the second plot to link the various elements. This option is used referencing the previous object, meaning that I reference the map in Hotel.google to the map map to Greggs.google, while in the last call I reference it to the previous Hotel.google, not the very first.

The zoom level, if you want to set it, goes only in the first plot.

Another thing I changed compared to the last example is the addition of custom icons to the plot, using the option iconMarker. This takes a vector of icons, not just one, with the same length of the SpatialObject to be plotted. That is why I use the function rep, to create a vector with the same URL repeated for a number of times equal to the length of the object.
The icon can be whatever image you like. You can find a collection of free icons from this website: http://kml4earth.appspot.com/icons.html

The result is the map below, available here: Multiple_Objects_GoogleMaps.html

leafletR
We can do the same thing using leafletR. We first need to create GeoJSON files for each element of the map using the following lines:

Greggs.geojson <- toGeoJSON(Greggs)
Hotel.geojson <- toGeoJSON(Hotel)
Footpaths.geojson <- toGeoJSON(Footpaths)

Now we need to set the style for each element. For this task we are going to use the function styleSingle, which basically defines a single style for all the elements of the GeoJSON. This differ from the map in a previous post in which we used the function styleGrad to create graduated colors depending of certain features of the dataset.
We can change the icons of the elements in leafletR using the following code:

Greggs.style <- styleSingle(marker=c("fast-food", "red", "s"))
Hotel.style <- styleSingle(marker=c("lodging", "blue", "s"))
Footpaths.style <- styleSingle(col="darkred",lwd=4)

As you can see we have the option marker that takes a vector with the name of the icon, its color and its size (between “s” for small, “m” for medium and “l” for large). The names of the icons can be found here: https://www.mapbox.com/maki/, where you have a series of icons and if you hover the mouse over them you would see some info, among which there is the name to use here, as the very last name. The style of the lines is set using the two options col and lwd, for line width.

Then we can simply use the function leaflet to set the various elements and styles of the map:

leaflet(c(Greggs.geojson,Hotel.geojson,Footpaths.geojson),style=list(Greggs.style,Hotel.style,Footpaths.style),popup=list(c("NAME"),c("NAME"),c("OPERATOR")),base.map="osm")

The result is the image below and the map available here: http://www.fabioveronesi.net/Blog/map.html

R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

SAS PROC MCMC example 12 in R: Change point model

Sun, 2015-06-21 04:14

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I restarted at working my way through the PROC MCMC examples. The SAS manual describes this example: Consider the data set from Bacon and Watts (1971), where  is the logarithm of the height of the stagnant surface layer and the covariate  is the logarithm of the flow rate of water. 
It is a simple example. It provided no problems at all for STAN and Jags. For LaplacesDemon on the other hand I had some problems. It took me quite some effort to obtain samples which seemed to be behaving. I did not try to do this in MCMCpack, but noted that the function MCMCregressChange() uses a slightly different model. The section below shows first the results, at the bottom the code is given.

Previous post in the series PROC MCMC examples programmed in R were: example 61.1: sampling from a known densityexample 61.2: Box Cox transformationexample 61.5: Poisson regressionexample 61.6: Non-Linear Poisson Regressionexample 61.7: Logistic Regression Random-Effects Model, and example 61.8: Nonlinear Poisson Regression Multilevel Random-Effects Model
Data

Data are read as below.
observed <-
‘1.12  -1.39   1.12  -1.39   0.99  -1.08   1.03  -1.08
0.92  -0.94   0.90  -0.80   0.81  -0.63   0.83  -0.63
0.65  -0.25   0.67  -0.25   0.60  -0.12   0.59  -0.12
0.51   0.01   0.44   0.11   0.43   0.11   0.43   0.11
0.33   0.25   0.30   0.25   0.25   0.34   0.24   0.34
0.13   0.44  -0.01   0.59  -0.13   0.70  -0.14   0.70
-0.30   0.85  -0.33   0.85  -0.46   0.99  -0.43   0.99
-0.65   1.19′
observed <- scan(text=gsub(‘[[:space:]]+’,’ ‘,observed),
    what=list(y=double(),x=double()),
    sep=’ ‘)
stagnant <- as.data.frame(observed)
LaplacesDemon

I have been playing around with LaplacesDemon. There is actually a function
LaplacesDemon.hpc which can use multiple cores. However, on my computer it seems more efficient just to use mclapply() from the parallel package and give the result class LaplacesDemon.hpc . Having said that, I had again quite some trouble to get LaplacesDemon to work well. In the end I used a combination of two calls to LaplacesDemon. The plot below shows selected samples after the first run. Not good enough, but that I do like this way of displaying the results of chains. It should be added that the labels looked correct with all parameters. However, that gave to much output for this blog. In addition, after the second call the results looked acceptable.


Call:
LaplacesDemon(Model = Model, Data = MyData, Initial.Values = apply(cc1$Posterior1,
    2, median), Covar = var(cc1$Posterior1), Iterations = 1e+05,
    Algorithm = “RWM”)

Acceptance Rate: 0.2408
Algorithm: Random-Walk Metropolis
Covariance Matrix: (NOT SHOWN HERE; diagonal shown instead)
       alpha        beta1        beta2           cp           s2
4.920676e-04 2.199525e-04 3.753738e-04 8.680339e-04 6.122862e-08

Covariance (Diagonal) History: (NOT SHOWN HERE)
Deviance Information Criterion (DIC):
          All Stationary
Dbar -144.660   -144.660
pD      7.174      7.174
DIC  -137.486   -137.486
Initial Values:
        alpha         beta1         beta2            cp            s2
 0.5467048515 -0.4100040451 -1.0194586232  0.0166405998  0.0004800931

Iterations: 4e+05
Log(Marginal Likelihood): 56.92606
Minutes of run-time: 1.32
Model: (NOT SHOWN HERE)
Monitor: (NOT SHOWN HERE)
Parameters (Number of): 5
Posterior1: (NOT SHOWN HERE)
Posterior2: (NOT SHOWN HERE)
Recommended Burn-In of Thinned Samples: 0
Recommended Burn-In of Un-thinned Samples: 0
Recommended Thinning: 5
Specs: (NOT SHOWN HERE)
Status is displayed every 100 iterations
Summary1: (SHOWN BELOW)
Summary2: (SHOWN BELOW)
Thinned Samples: 40000
Thinning: 1

Summary of All Samples
                  Mean           SD         MCSE      ESS            LB
alpha     5.348239e-01 0.0244216567 2.999100e-04 11442.06  4.895229e-01
beta1    -4.196180e-01 0.0142422533 1.661658e-04 12726.60 -4.469688e-01
beta2    -1.013882e+00 0.0164892337 1.681833e-04 15191.59 -1.046349e+00
cp        2.855852e-02 0.0308177765 3.649332e-04 11945.66 -3.406306e-02
s2        4.472644e-04 0.0001429674 1.383748e-06 16571.94  2.474185e-04
Deviance -1.446602e+02 3.7879060637 4.940488e-02 10134.91 -1.496950e+02
LP        4.636511e+01 1.8939530321 2.470244e-02 10134.91  4.164313e+01
                Median            UB
alpha       0.53339024  5.842152e-01
beta1      -0.41996859 -3.903572e-01
beta2      -1.01387256 -9.815650e-01
cp          0.03110570  8.398674e-02
s2          0.00042101  8.006666e-04
Deviance -145.46896682 -1.352162e+02
LP         46.76949458  4.888251e+01

Summary of Stationary Samples
                  Mean           SD         MCSE      ESS            LB
alpha     5.348239e-01 0.0244216567 2.999100e-04 11442.06  4.895229e-01
beta1    -4.196180e-01 0.0142422533 1.661658e-04 12726.60 -4.469688e-01
beta2    -1.013882e+00 0.0164892337 1.681833e-04 15191.59 -1.046349e+00
cp        2.855852e-02 0.0308177765 3.649332e-04 11945.66 -3.406306e-02
s2        4.472644e-04 0.0001429674 1.383748e-06 16571.94  2.474185e-04
Deviance -1.446602e+02 3.7879060637 4.940488e-02 10134.91 -1.496950e+02
LP        4.636511e+01 1.8939530321 2.470244e-02 10134.91  4.164313e+01
                Median            UB
alpha       0.53339024  5.842152e-01
beta1      -0.41996859 -3.903572e-01
beta2      -1.01387256 -9.815650e-01
cp          0.03110570  8.398674e-02
s2          0.00042101  8.006666e-04
Deviance -145.46896682 -1.352162e+02
LP         46.76949458  4.888251e+01
STAN

Stan did not give much problems for this analysis.
Inference for Stan model: smodel.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Beta[1] -0.42    0.00 0.01 -0.45 -0.43 -0.42 -0.41 -0.39  1017 1.00
Beta[2] -1.01    0.00 0.02 -1.05 -1.02 -1.01 -1.00 -0.98  1032 1.00
Alpha    0.54    0.00 0.03  0.49  0.52  0.53  0.55  0.59   680 1.00
s2       0.00    0.00 0.00  0.00  0.00  0.00  0.00  0.00  1361 1.00
cp       0.03    0.00 0.03 -0.04  0.00  0.03  0.05  0.09   636 1.00
lp__    90.63    0.06 1.78 86.17 89.71 91.00 91.91 93.03   935 1.01

Samples were drawn using NUTS(diag_e) at Fri Jun 19 21:17:54 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
JAGS

Again no problems for Jags.
Inference for Bugs model at “/tmp/Rtmpy4a6C5/modeld4f6e9c9055.txt”, fit using jags,
 4 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
 n.sims = 4000 iterations saved
          mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat
alpha       0.534   0.027    0.479    0.518    0.533    0.552    0.586 1.040
beta[1]    -0.420   0.015   -0.450   -0.429   -0.420   -0.410   -0.389 1.013
beta[2]    -1.014   0.017   -1.049   -1.024   -1.014   -1.003   -0.980 1.023
cp          0.029   0.035   -0.038    0.006    0.032    0.051    0.100 1.037
s2          0.000   0.000    0.000    0.000    0.000    0.001    0.001 1.004
deviance -144.501   3.986 -149.634 -147.378 -145.432 -142.584 -134.378 1.021
         n.eff
alpha      160
beta[1]    380
beta[2]    290
cp         160
s2         710
deviance   290

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 7.9 and DIC = -136.6
DIC is an estimate of expected predictive error (lower deviance is better).
CODE used

# http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_mcmc_examples18.htm
# Example 61.12 Change Point Models
##############
#Data                                                   
##############
observed <-
‘1.12  -1.39   1.12  -1.39   0.99  -1.08   1.03  -1.08
0.92  -0.94   0.90  -0.80   0.81  -0.63   0.83  -0.63
0.65  -0.25   0.67  -0.25   0.60  -0.12   0.59  -0.12
0.51   0.01   0.44   0.11   0.43   0.11   0.43   0.11
0.33   0.25   0.30   0.25   0.25   0.34   0.24   0.34
0.13   0.44  -0.01   0.59  -0.13   0.70  -0.14   0.70
-0.30   0.85  -0.33   0.85  -0.46   0.99  -0.43   0.99
-0.65   1.19′
observed <- scan(text=gsub(‘[[:space:]]+’,’ ‘,observed),
    what=list(y=double(),x=double()),
    sep=’ ‘)
stagnant <- as.data.frame(observed)
#plot(y~x,data=stagnant)
##############
#LaplacesDemon                                                   
##############
library(‘LaplacesDemon’)
library(parallel)

mon.names <- “LP”
parm.names <- c(‘alpha’,paste(‘beta’,1:2,sep=”),’cp’,’s2′)

PGF <- function(Data) {
  x <-c(rnorm(5,0,1))
  x[4] <- runif(1,-1.3,1.1)
  x[5] <- runif(1,0,2)
  x
}
MyData <- list(mon.names=mon.names,
    parm.names=parm.names,
    PGF=PGF,
    x=stagnant$x,
    y=stagnant$y)
#N<-1
Model <- function(parm, Data)
{
  alpha=parm[1]
  beta=parm[2:3]
  cp=parm[4]
  s2=parm[5]
  yhat <- alpha+(Data$x-cp)*beta[1+(Data$x>=cp)]
  LL <- sum(dnorm(Data$y,yhat,sd=sqrt(s2),log=TRUE))
  prior <- sum(dnorm(parm[1:3],0,1e3,log=TRUE))+
      dunif(cp,-1.3,1.1,log=TRUE)+
      dunif(s2,0,5,log=TRUE)
  LP=LL+prior
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,
      yhat=yhat,
      parm=parm)
  return(Modelout)
}
Fit1 <- mclapply(1:4,function(i)
      LaplacesDemon(Model,
    Data=MyData,
    Iterations=100000,
    Algorithm=’RWM’,
    Covar=c(rep(.01,4),.00001),
    Initial.Values = c(.5,-.4,-1,.05,.001)) #Initial.Values 
)
class(Fit1) <- ‘demonoid.hpc’
plot(Fit1,Data=MyData,Parms=c(‘alpha’))
cc1 <- Combine(Fit1,MyData)
#
Fit2 <- mclapply(1:4,function(i)
      LaplacesDemon(Model,
          Data=MyData,
          Iterations=100000,
          Algorithm=’RWM’,
          Covar=var(cc1$Posterior1),
          Initial.Values = apply(cc1$Posterior1,2,median)) #Initial.Values 
)
class(Fit2) <- ‘demonoid.hpc’
#plot(Fit2,Data=MyData,Parms=c(‘alpha’))
cc2 <- Combine(Fit2,MyData)
cc2
##############
#STAN                                                   
##############
stanData <- list(
    N=nrow(stagnant),
    x=stagnant$x,
    y=stagnant$y)

library(rstan)
smodel <- ‘
    data {
    int <lower=1> N;
    vector[N] x;
    vector[N] y;
    }
    parameters {
    real Beta[2];
    real Alpha;
    real <lower=0> s2;
    real <lower=-1.3,upper=1.1> cp;
    }
    transformed parameters {
    vector[N] yhat;
    for (i in 1:N)
       yhat[i] <- Alpha+(x[i]-cp)*Beta[1+(x[i]>cp)];
    }
    model {
    y ~ normal(yhat,sqrt(s2));
    s2 ~ uniform(0,1e3);
    cp ~ uniform(-1.3,1.1);
    Alpha ~ normal(0,1000);
    Beta ~ normal(0,1000);
    }
    ‘
fstan <- stan(model_code = smodel,
    data = stanData,
    pars=c(‘Beta’,’Alpha’,’s2′,’cp’))
fstan
##############
#Jags                                                  
##############
library(R2jags)jagsdata <- list(
    N=nrow(stagnant),
    x=stagnant$x,
    y=stagnant$y)

jagsm <- function() {
  for(i in 1:N) {
    yhat[i] <- alpha+(x[i]-cp)*beta[1+(x[i]>cp)]
    y[i] ~ dnorm(yhat[i],tau)
  }
  tau <- 1/s2
  s2 ~  dunif(0,1e3)
  cp ~ dunif(-1.3,1.1)
  alpha ~ dnorm(0,0.001)
  beta[1] ~ dnorm(0,0.001)
  beta[2] ~ dnorm(0,0.001)
}
params <- c(‘alpha’,’beta’,’s2′,’cp’)
myj <-jags(model=jagsm,
    data = jagsdata,
    n.chains=4,
    n.iter=10000,
    parameters=params,
)
myj

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

dimple charts for R

Sat, 2015-06-20 19:20

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)

dimple is a simple-to-use charting API powered by D3.js.

Making use of the nice htmlwidgets package it only took a minimum amount of coding to make the dimple library available from R.

You can find the dimple R package at github.com/Bart6114/dimple and some documentation and examples at: bart6114.github.io/dimple (can take a while to load). Using the package you can create static javascript-based graphs, but you can also use dimple charts in Shiny applications.

To leave a comment for the author, please follow the link and comment on his blog: FishyOperations. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Interactive R Notebooks with Jupyter and SageMathCloud

Sat, 2015-06-20 19:00

(This article was first published on Statisfactions: The Sounds of Data and Whimsy » R, and kindly contributed to R-bloggers)

The Jupyter Notebook interface allows users to seamlessly mix code, output, and markdown commentary all in the same document.  Sound a bit like R Markdown? The difference is that in a Jupyter Notebook, you execute the code right in the browser and view everything in the same visual view.

Jupyter, formerly known as the IPython Notebook, has a rich ecosystem, especially among Python users. Science News recently featured Jupyter and its benefits for reproducible research and collaboration; Github, which automagically renders Jupyter notebook files on their website, has a gallery featuring a wide variety of cool static renderings of notebooks.

The name Jupyter (JUlia, PYThon, R) shows that the project has ambitions outside of its Python homeland. But so far, I’ve seen very little discussion in the R community, besides for Tony Hirst’s recent musings on synergies between R Markdown and Jupyter.. R basically works in Jupyter, though it has nothing like the support that knitr now has for shiny and a variety of other fancier interactive R features in R Markdown and Rnw files.

It’s still pretty cool to be able to see the output & change the code together in Jupyter, though. First, I recommend checking out the demo on the Jupyter web site. You can then click on the
“Welcome to R” demo and play around with the code:

To update the output from a compute cell, you can press CTRL + ENTER or simply press the ▶ on the upper toolbar.  Note that Jupyter also has nice Tab-completion, menus, and a host of other features! You can also upload or start your own files.

But, this is just a temporary server.  How can you bring Jupyter into your workflow?  You can install Jupyter and the R Kernel locally on your computer, which is relatively straightforward on Linux but looks like a pain on any other OS.

Another way: there are a few commercial cloud services that host Jupyter notebooks, and the one I like right now is SageMathCloud. The Sage Math project has come a long way on its goals of developing open-source math and collaboration tools, and I see a host of fascinating features on this app, such as course management tools for classroom use, collaborative editing and live video chat alongside a project & file system for managing Jupyter notebooks and Sage Math worksheets, and the ability to open multiple notebooks in different tabs:

 

I have been startled both by the robustness and activity of the development of Jupyter, and the relatively little discussion from the R community.  Let’s get into Jupyter and see what we can create, and how we can shape the future of this highly promising tool for collaborative, reproducible data science & statistics.

To leave a comment for the author, please follow the link and comment on his blog: Statisfactions: The Sounds of Data and Whimsy » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

DO Something Nifffty with R

Fri, 2015-06-19 19:07

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

@briandconnelly (of pushoverr fame) made a super-cool post about connecting R to @IFTTT via IFTTT’s “Maker” channel. The IFTTT Maker interface to receive events is fairly straightforward and Brian’s code worked flawlessly, so it was easy to tweak a bit and wrap into a package.

To get started, you can clone an example public IFTTT recipe that posts data to a Dropbox folder. All you need to do is add the IFTTT channel to your account and add the API key to your .Renviron file, which will look something like:

IFTTT_API_KEY=uGlySTringOfCharACTers-

Sticking it in the .Renviron file is a nice platform-independent way of getting the variable in your environment (and you can set a similar environment variable in Travis for automated testing). NOTE that you’ll need to restart your R/RStudio session for that value to become available (what, you don’t have RStudio up all the time?).

Once you have the key in place and the IFTTT recipe setup, all you have to do is:

devtools::install_github("hrbrmstr/nifffty") library(nifffty) maker("rtest", "this", "is a", "test")

and you’ll have a file named something like june_19__2015_at_0927am.txt in a Dropbox folder that should have values like:

Value 1: this Value 2: is a Value 3: test

The potential IFTTT integrations are not limited to Dropbox (or iOS/Android notification as in Brian’s example). You can literally trigger anything from R in the IFTTT catalogue. Go foRth and automate!

DOing even MoRe

As stated, Brian’s code was complete, so making a package to post to IFTTT’s Maker channel was really easy. But, this is R and if we can get data to it, things can be even cooler. So, I made a receiver function that will catch Maker channel web requests and do stuff with them. Since I have an Apple Watch, I decided I wanted to test the interface out with IFTTT’s DO button and have R receive my coordinates whenever I press this:

(That’s an actual screen capture from the Apple Watch).

If you have both R and an Apple Watch, you can create a DO app as such (NOTE that DO apps work fine on the iPhone without an Apple Watch):

That will POST a bit of JSON to whatever is listening for it. To get R to listen to it, you need to build a small script that I’ll call listen.R, create a function to process the data and register it with the instance of the Rook web server that it will automagically create for you. You can specify which IP address and/or port to listen on as well. NOTE that this R server must be reachable from the internet, so you may need to do port forwarding if behind a NAT firewall.

My sample listen.R looks like this:

library(nifffty)   # not the best name for a function but it works do_it <- function(req) { require(jsonlite) print(fromJSON(names(req$POST()))) writeLines(names(req$POST()), "/tmp/bob.txt") }   # you can change the port to match what you put into IFTTT rcvr <- receiver(port=10999, handler=do_it)   # this is not necessary but it's a good diagnostic for a test script print(rcvr)   # keeps the script alive while (TRUE) Sys.sleep(24 * 60 * 60)

You can run that from a command-line (I tested it on Ubuntu) as such:

bob@server:~$ Rscript listen.R Loading required package: methods Loading required package: Rook Server started on 0.0.0.0:10999 [1] nifffty http://0.0.0.0:10999/custom/nifffty   Call browse() with an index number or name to run an application.

When you tap the DO button on the Apple Watch, you’ll see the following in the console:

Loading required package: jsonlite   Attaching package: ‘jsonlite’   The following object is masked from ‘package:utils’:   View   $latitude [1] 43.25931   $longitude [1] -70.80062   $timestamp [1] "June 19, 2015 at 04:45PM"

and the following in the file it was instructed to write to:

bob@server:~$ cat /tmp/bob.txt { "latitude": 43.2593566552207, "longitude": -70.8004647307757, "timestamp": "June 19, 2015 at 04:46PM" }

JSON is a good choice given how easy it is to work with in R and you can have R do anything you can dream up with the data that’s sent to it. You can connect any supported IFTTT component and hook any variable from said component into the JSON that R will receive.

Coming Soon

The next step is to make a Shiny interface so it can receive and process real-time events. I’m hoping to make a sample Shiny app that will update my location on a map every time I tap the DO button and also compute some stats for it. I also need to put some polish on the receiver interface (add logging and some error checking).

If you have cool ideas or have hooked R up to IFTTT’s Maker channel to do something cool, drop a note in the comments and definitely submit any pull requests or issues on github.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Segmentation of methylation profiles using methylKit

Fri, 2015-06-19 17:58

(This article was first published on Recipes, scripts and genomics, and kindly contributed to R-bloggers)

methylKit is an R package for DNA methylation analysis and annotation using high-throughput bisulfite sequencing data. We recently included a new function in methylKit called methSeg(). The function can segment methylation profiles (methylRaw objects) or differential methylation profiles (methylDiff objects). Currently the function is available on a different branch for development. You can install the version of methylKit that has the segmentation functionality as shown below. We will move it to the master branch once we have more feedback for the function.

Rationale


/* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:"Times New Roman"; mso-ascii-font-family:Cambria; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"MS 明朝"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Cambria; mso-hansi-theme-font:minor-latin;}

Regulatory regions can be discovered by analyzing methylation patterns. One of the popular methods is to segment methylomes into distinct categories such as regions with low methylation or high methylation. The most popular method for categorical methylome segmentation is based on hidden Markov models (HMM) (an example is here). Another segmentation strategy is to use variants of change-point analysis, where change points in a signal across the genome is extracted and the genome is segmented to regions between two change points These methods are typically used in CNV (copy-number variation) detection but have applications in methylation context as well.

In the context of methylation, we first extract segments between change points of the signal (methylation profiles) then cluster those segments into groups based on the mean methylation values of the segments. In this case, a segment is a region of the genome where CpGs have similar methylation values. We cluster those segments into segment groups using mixture modeling. The segment groups correspond to highly methylated regions or lowly methylated regions. Since this is an unsupervised method, we can find more patterns than just highly or lowly methylated regions. The users have to further overlap the segments with other marks and annotation such as promoters, H3K27ac or other histone modifications to associate segment groups with functional annotation. Use case and example Below, we are using methylation data from Roadmap Epigenomics H1 cell line. The code demonstrates how to use the function and export the results to a BED file. Similarly, we can apply this function on differential methylation profiles, where values range between -100 and 100. Do help(methSeg) to see the help page for the function.

To leave a comment for the author, please follow the link and comment on his blog: Recipes, scripts and genomics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R 3.2.1 is released

Fri, 2015-06-19 12:53

(This article was first published on R-statistics blog » R, and kindly contributed to R-bloggers)

R 3.2.1 (codename “World-Famous Astronaut”) was released yesterday. You can get the latest binaries version from here. (or the .tar.gz source code from here). The full list of new features and bug fixes is provided below.

Upgrading to R 3.2.1 on Windows

If you are using Windows you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

install.packages("installr") # install installr::updateR() # updating R.

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the installr package.

I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.

CHANGES IN R 3.2.1:

 

NEW FEATURES
  • utf8ToInt() now checks that its input is valid UTF-8 and returns NA if it is not.
  • install.packages() now allows type = "both" with repos = NULL if it can infer the type of file.
  • nchar(x, *) and nzchar(x) gain a new argument keepNA which governs how the result for NAs in x is determined. For the R 3.2.x series, the default remains FALSE which is fully back compatible. From R 3.3.0, the default will change to keepNA = NA and you are advised to consider this for code portability.
  • news() more flexibly extracts dates from package ‘NEWS.Rd’ files.
  • lengths(x) now also works (trivially) for atomic x and hence can be used more generally as an efficient replacement of sapply(x, length) and similar.
  • The included version of PCRE has been updated to 8.37, a bug-fix release.
  • diag() no longer duplicates a matrix when extracting its diagonal.
  • as.character.srcref() gains an argument to allow characters corresponding to a range of source references to be extracted.
BUG FIXES

  • acf() and ccf() now guarantee values strictly in [-1,1] (instead of sometimes very slightly outside). PR#15832.
  • as.integer("111111111111") now gives NA (with a warning) as it does for the corresponding numeric or negative number coercions. Further, as.integer(M + 0.1) now gives M (instead of NA) when M is the maximal representable integer.
  • On some platforms nchar(x, "c") and nchar(x, "w") would return values (possibly NA) for inputs which were declared to be UTF-8 but were not, or for invalid strings without a marked encoding in a multi-byte locale, rather than give an error. Additional checks have been added to mitigate this.
  • apply(a, M, function(u) c(X = ., Y = .)) again has dimnames containing “X” and “Y” (as in R < 3.2.0).
  • (Windows only) In some cases, the --clean option to R CMD INSTALL could fail. (PR#16178)
  • (Windows only) choose.files() would occasionally include characters from the result of an earlier call in the result of a later one. (PR#16270)
  • A change in RSiteSearch() in R 3.2.0 caused it to submit invalid URLs. (PR#16329)
  • Rscript and command line R silently ignored incomplete statements at the end of a script; now they are reported as parse errors. (PR#16350)
  • Parse data for very long strings was not stored. (PR#16354)
  • plotNode(), the workhorse of the plot method for "dendrogram"s is no longer recursive, thanks to Suharto Anggono, and hence also works for deeply nested dendrograms. (PR#15215)
  • The parser could overflow internally when given numbers in scientific format with extremely large exponents. (PR#16358)
  • If the CRAN mirror was not set, install.packages(type = "both") and related functions could repeatedly query the user for it. (Part of PR#16362)
  • The low-level functions .rowSums() etc. did not check the length of their argument, so could segfault. (PR#16367)
  • The quietly argument of library() is now correctly propagated from .getRequiredPackages2().
  • Under some circumstances using the internal PCRE when building R fron source would cause external libs such as -llzma to be omitted from the main link.
  • The .Primitive default methods of the logic operators, i.e., !, & and |, now give correct error messages when appropriate, e.g., for `&`(TRUE) or `!`(). (PR#16385)
  • cummax(x) now correctly propagates NAs also when x is of type integer and begins with an NA.
  • summaryRprof() could fail when the profile contained only two records. (PR#16395)
  • HTML vignettes opened using vignette() did not support links into the rest of the HTML help system. (Links worked properly when the vignette was opened using browseVignettes() or from within the help system.)
  • arima(*, xreg = .) (for d >= 1) computes estimated variances based on a the number of effective observations as in R version 3.0.1 and earlier. (PR#16278)
  • slotNames(.) is now correct for "signature" objects (mostly used internally in methods).
  • On some systems, the first string comparison after a locale change would result in NA.
 

To leave a comment for the author, please follow the link and comment on his blog: R-statistics blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs