R bloggers

Syndicate content
R news and tutorials contributed by (552) R bloggers
Updated: 6 hours 30 min ago

Buster – a new R package for bagging hierarchical clustering

Wed, 2014-07-09 06:06

(This article was first published on Drunks&Lampposts » R, and kindly contributed to R-bloggers)

I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example k-means clustering, not for hierarchical clustering.

It’s not a typical everyday human dilemma. It needs a bit more explanation.

Some background

Clustering (assuming everyone is happy with this technique but if not click here) typically works on a matrix of distances between data points. We sometimes refer to the distances as dissimilarities – the greater the distance the more dissimilar the data points. So in a simple case the data points might be customers and the distances reflect how different the customers are in terms of demographics, purchasing behaviour etc.

A simple way to achieve this would be to plot your data points against a set of axes representing the things you would like to include in the dissimilarity measure and then just measure the distance between the points. Scaling the data would ensure that none of the attributes are prioritised over the others. For example plot your customers by age and income (both standardised) and then measure the distance between them to determine how similar they are in age and income.

This is Euclidean distance and it is implicit in many popular and powerful clustering algorithms for example k-means and its variants

But how do you know if your measure of dissimilarity is representable as a Euclidean distance and therefore amenable to k-means? It’s simple enough if you started with some variables and derived your Euclidean distance but it doesn’t always work this way. For example suppose I am measuring similarity by correlation. Is the absolute of correlation representable as a Euclidean distance? Is there some n-dimensional space, where we could plot our data points such that the distance between them represented the absolute value of their correlation?

A quick check to see if your measure is ever going to be representable as euclidean distance is: does it satisfy the triangle inequality?

where is the distance between x and y.

For example love and hate (assuming they are quantifiable) do not satisfy the triangle equality. If x loves y and y loves z this places no constraint on the degree to which x loves z. It could easily be more than the sum of the love of x for y and y for z!

Continue reading at my new professional blog coppelia.io

To leave a comment for the author, please follow the link and comment on his blog: Drunks&Lampposts » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Recording of OpenCPU talk at #useR2014

Wed, 2014-07-09 03:00

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

A recording of the useR! 2014 prentation about OpenCPU is now available on Youtube. This talk gives a brief (20 minute) motivation and introduction to some of the high level concepts of the OpenCPU system. The video contains mostly screen recording, mixed with some AV footage provided by Timothy Phan (thanks!).

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Can Rcpp fuse ?

Tue, 2014-07-08 20:33

(This article was first published on R Enthusiast and R/C++ hero, and kindly contributed to R-bloggers)

One of the features of Rcpp11 people seemed to like during useR is the fuse function. fuse is somewhat similar to the c function in R.

For the purpose of this post, let's simplify what fuse does, and just say that it takes several compatible vectors and fuses them together into one.

// some vectors NumericVector a = {1.0, 2.3}, b = {2.5, 5.7}, c = {4.2, 4.1, 1.4}; // fuse them into a vector of size 7 NumericVector d = fuse(a, b, c) ;

One nice thing is that it also handles scalars, e.g :

// some vectors NumericVector a = {1.0, 2.3}, c = {4.2, 4.1, 1.4}; // fuse them into a vector of size 6 NumericVector d = fuse(a, 3.5, c) ;

So people seem to like that and I don't blame them, that's a cool feature. Then usually what happens is that I get :

That's a cool feature you have in Rcpp11. Can I use it in Rcpp too ?

I have mixed feelings about that kind of question. It is nice that people acknowledge that this is a nice feature and they want to use it, but at the same time I'd like people to stop thinking that Rcpp11 is some kind of a lab for things that will go later into Rcpp. Some might, but it is unlikely that I will participate in that effort. I'd rather help people migrating to Rcpp11.

Conceptually, fuse is really simple. For each argument, it figures out if this is a scalar or a vector, retrieves the size of the vector to create, creates it, and then loops around to copy data from the inputs to the output. So fuse is just a glorified set of for loops.

In Rcpp11 because we can assume C++11, this can be written using variadic templates, i.e. template functions taking a variable number of arguments. Rcpp has to remain compatible with C++98, so if we wanted to port fuse to Rcpp, we would have to ship a C++98 compatible version. One overload for 2 parameters, one for 3 parameters, one for 4 parameters, ...

This at least brings two problems to the table:

  • Code bloat. This would be one of these features that add thousands of lines of code. This is hard to maintain and slows the compiler down yet again.

  • Where to stop ? NumericVector::create arbitrarily stops at 20 parameters. Of course fusing 20 vectors is going to be enough for most uses case, but on occasions people will feel the desire to go beyond, and the compiler will welcome them with lovely multi page errors.

To leave a comment for the author, please follow the link and comment on his blog: R Enthusiast and R/C++ hero. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Are Consumer Preferences Deep or Shallow?

Tue, 2014-07-08 18:49

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

John Hauser, because no one questions his expertise, is an excellent spokesperson for the viewpoint that consumer preferences are real, as presented in his article "Self-Reflection and Articulated Consumer Preferences." Simply stated, preferences are enduring when formed over time and after careful consideration of actual products. As a consequence, accurate measurement requires us to encourage self-reflection within realistic contexts. "Here true preferences mean the preferences consumers use to make decisions after a serious evaluation of the products that are available on the market."

However, serious evaluation takes some time and effort, in fact, a series of separate online tasks including revealed preference plus self-reports of both attribute-level preferences and decision-making strategies. We end up with a lot of data from each respondent enabling the estimation of a number of statistical models (e.g., a hierarchical Bayes choice-based conjoint that could be fit using the bayesm R package). All this data is deemed necessary in order for individuals to learn their "true" preferences. Underlying Hauser's approach is a sense of inevitably that a decision maker will arrive at the same resolution regardless of their path as long as they begin with self-reflection.

A more constructivist alternative can be found in my post on "The Mind is Flat!" where it is argued that we lack the cognitive machinery to generate, store and retrieve the extensive array of enduring preferences demanded by utility theory. Although we might remember our previous choice and simply repeat it as a heuristic simplification strategy, working our way through the choice processes anew will likely result in a different set of preferences. Borrowing a phrase from Stephen Jay Gould, replaying the "purchase process tape" will not yield the same outcome unless there are substantial situational constraints forcing the same resolution.
Do preferences control information search, or are preferences evoked by the context? Why would we not expect decision making to be adaptive and responsive to the situation? Enduring preferences may be too rigid for our constantly changing marketplaces. Serendipity has its advantages. After the fact, it is easy to believe that whatever happened had to be. Consider the case study from Hauser's article, and ask what if there had not been an Audi dealership near Maria? Might she been just as satisfied or perhaps even more happy with her second choice? It all works out for the best because we are inventive storytellers and cognitive dissonance will have its way. Isn't this the lesson from choice blindness?

Still, most of marketing research continues to believe in true and enduring preferences that can be articulated by the reflective consumer even when confronted by overwhelming evidence that the human brain is simply not capable of such feats. We recognize patterns, even when they are not there, and we have extensive episodic memory for stuff that we have experienced. We remember faces and places, odors and tastes, and almost every tune we have ever heard, but we are not as proficient when it comes to pin numbers and passwords or dates or even simple calculations. Purchases are tasks that are solved not by looking inside for deep and enduring preferences. Instead, we exploit the situation or task structure and engage in fast thinking with whatever preferences are elicited by the specifics of the setting. Consequently, preferences are shallow and contextual.

As long as pre-existing preferences were in control, we were free to present as many alternatives and feature-levels as we wished. The top-down process would search for what it preferred and the rest would be ignored. However, as noted above, context does matter in human judgment and choice. Instead of deciding what you feel like eating (top down), you look at the menu and see what looks good to you (bottom up). Optimal experimental designs that systematically manipulate every possible attribute must be replaced by attempts to mimic the purchase context as closely as possible, not just the checkout but the entire consumer decision journey. Purchase remains the outcome of primary interest, but along the way attention becomes the dependent variable for "a wealth of information creates a poverty of attention" (Herbert A. Simon).

Future data collection will have us following consumers around real or replicated marketplaces and noting what information was accessed and what was done. Our statistical model will then be forced to deal with the sparsity resulting from consumers who concentrate their efforts on only a very few of the many touchpoints available to them. My earlier post on identifying the pathways in the consumer decision journey will provide some idea of what such an analysis might look like. In particular, I show how the R package nmf is able to uncover the underlying structure when the data matrix is sparse. More will follow in subsequent posts.



To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Speed Tests for Rolling/Running Functions

Tue, 2014-07-08 14:49

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

I use rolling and running functions almost daily with financial time series. In my post A Whole New World with Chains and Pipes, I made this statement I have noticed that rolling analysis with xts can sometimes be slow. as.matrix is my favorite way to speed things up, since I usually do not need xts powerful indexing and subsetting features. I felt like I be a little more thorough, so I put

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R Day at Strata NYC

Tue, 2014-07-08 12:50

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio will teach the new essentials for doing data science in R at this year’s Strata NYC conference, Oct 15 2014.

R Day at Strata is a full day of tutorials that will cover some of the most useful topics in R. You’ll learn how to manipulate and visualize data with R, as well as how to write reproducible, interactive reports that foster collaboration. Topics include:

9:00am – 10:30am
A Grammar of Data Manipulation with dplyr
Speaker: Hadley Wickham

11:00am – 12:30pm
A Reactive Grammar of Graphics with ggvis
Speaker: Winston Chang

1:30pm – 3:00pm
Analytic Web Applications with Shiny
Speaker: Garrett Grolemund

3:30pm – 5:00pm
Reproducible R Reports with Packrat and Rmarkdown
Speaker: JJ Allaire & Yihui Xie

The tutorials are integrated into a cohesive day of instruction. Many of the tools that we’ll cover did not exist six months ago, so you are almost certain to learn something new. You will get the most out of the day if you already know how to load data into R and have some basic experience visualizing and manipulating data.

Visit strataconf.com/stratany2014 to learn more and register! Early bird pricing ends July 31.

Not available on October 15? Check out Hadley’s Advanced R Workshop in New York City on September 8 and 9, 2014.

 


To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

DSC 2014, Day 2

Tue, 2014-07-08 12:16

(This article was first published on JAGS News » R, and kindly contributed to R-bloggers)

This will be somewhat shorter summary of the second day of DSC 2014. There was a wider variety of presentations today, but I am only going to discuss those that touched on the main theme of the day, which was reproducibility.

Subscribers to the R-devel mailing list will recall a long discussion started in March by Jeroen Ooms called A case for freezing CRAN. Jeroen described the difficulties reproducing previous analyses (either those done by other people or done by oneself at a previous time) due to changes introduced in R packages. Jeroem proposed that a frozen version of CRAN be made available for each R release so that all users with the same version of R would have access to the same set of packages. Several presentations at DSC directly addressed this challenge

Simon Urbanek (AT&T) presented RCloud, an environment for collaboratively sharing R scripts. Data and code are pushed into the cloud so that the user does not care where it is stored. All user work is contained in a “notebook” which is automatically under version control. Notebooks can be shared by simply sharing a URL with another RCloud user.

JJ Allaire presented packrat, a dependency management system for R packages that essentially creates a separate library for each project. Packrat creates isolated, portable and reproducible environments for R packages, and does not require any changes to CRAN.

Michael Lawrence presented GRAN, a package management system, but  my notes don’t allow me to say much about it here and the slides are not on the web site to jog my memory.

David Smith from Revolution Analytics presented the Reproducible R Toolkit (RRT) and its server-side counterpart the Marmoset R Archive Network (MRAN). This is the solution that most closely resembles Jeroem’s frozen CRAN. David suggested that it would be more useful to ordinary R users, as opposed to R developers who may prefer the approach taken by packrat or GRAN.

Special mention must go to Romain Francois, who drove through the night from the Rencontres R  conference in Monpellier to Brixen to give a talk about Rcpp11, then went straight on to Paris to fly out to the UseR! conference in California. I told Romain he was crazy.

Overall this was an excellent meeting. It was very important to get together people who had been communicating by email only for a long period and I was impressed at the increasing investment in R by companies.


To leave a comment for the author, please follow the link and comment on his blog: JAGS News » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Dependencies of popular R packages

Tue, 2014-07-08 11:48

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

With the growing popularity of R, there is an associated increase in the popularity of online forums to ask questions. One of the most popular sites is StackOverflow, where more than 60 thousand questions have been asked and tagged to be related to R.

On the same page, you can also find related tags. Among the top 15 tags associated with R, several are also packages you can find on CRAN:

  • ggplot2
  • data.table
  • plyr
  • knitr
  • shiny
  • xts
  • lattice

It very easy to install these packages directly from CRAN using the R function install.packages(), but this will also install all these package dependencies.

This leads to the question: How can one determine all these dependencies?

It is possible to do this using the function available.packages() and then query the resulting object.

But it is easier to answer this question using the functions in a new package, called miniCRAN, that I am working on. I have designed miniCRAN to allow you to create a mini version of CRAN behind a corporate firewall. You can use some of the function in miniCRAN to list packages and their dependencies, in particular:

  • pkgAvail()
  • pkgDep()
  • makeDepGraph()

I illustrate these functions in the following scripts.

Start by loading miniCRAN and retrieving the available packages on CRAN. Use the function pkgAvail() to do this:

library(miniCRAN) pkgdata <- pkgAvail(repos = c(CRAN="http://cran.revolutionanalytics.com"), type="source") head(pkgdata[, c("Depends", "Suggests")]) ## Depends Suggests ## A3 "R (>= 2.15.0), xtable, pbapply" "randomForest, e1071" ## abc "R (>= 2.10), nnet, quantreg, MASS" NA ## abcdeFBA "Rglpk,rgl,corrplot,lattice,R (>= 2.10)" "LIM,sybil" ## ABCExtremes "SpatialExtremes, combinat" NA ## ABCoptim NA NA ## ABCp2 "MASS" NA

 

Next, use the function pkgDep() to get dependencies of the 7 popular tags on StackOverflow:

tags <- c("ggplot2", "data.table", "plyr", "knitr", "shiny", "xts", "lattice") pkgList <- pkgDep(tags, availPkgs=pkgdata, suggests=TRUE) pkgList ## [1] "abind" "bit64" "bitops" "Cairo" ## [5] "caTools" "chron" "codetools" "colorspace" ## [9] "data.table" "dichromat" "digest" "evaluate" ## [13] "fastmatch" "foreach" "formatR" "fts" ## [17] "ggplot2" "gtable" "hexbin" "highr" ## [21] "Hmisc" "htmltools" "httpuv" "iterators" ## [25] "itertools" "its" "KernSmooth" "knitr" ## [29] "labeling" "lattice" "mapproj" "maps" ## [33] "maptools" "markdown" "MASS" "mgcv" ## [37] "mime" "multcomp" "munsell" "nlme" ## [41] "plyr" "proto" "quantreg" "RColorBrewer" ## [45] "Rcpp" "RCurl" "reshape" "reshape2" ## [49] "rgl" "RJSONIO" "scales" "shiny" ## [53] "stringr" "testit" "testthat" "timeDate" ## [57] "timeSeries" "tis" "tseries" "XML" ## [61] "xtable" "xts" "zoo"

 

Wow, look how these 7 packages have dependencies on 63 other packages!

You can graphically visualise these dependencies in a graph, by using the function makeDepGraph():

p <- makeDepGraph(pkgList, availPkgs=pkgdata) library(igraph)   plotColours <- c("grey80", "orange") topLevel <- as.numeric(V(p)$name %in% tags)   par(mai=rep(0.25, 4))   set.seed(50) vColor <- plotColours[1 + topLevel] plot(p, vertex.size=8, edge.arrow.size=0.5, vertex.label.cex=0.7, vertex.label.color="black", vertex.color=vColor) legend(x=0.9, y=-0.9, legend=c("Dependencies", "Initial list"), col=c(plotColours, NA), pch=19, cex=0.9) text(0.9, -0.75, expression(xts %->% zoo), adj=0, cex=0.9) text(0.9, -0.8, "xts depends on zoo", adj=0, cex=0.9) title("Package dependency graph")


So, if you wanted to install the 7 most popular packages R packages (according to StackOverflow), R will in fact download and install up to 63 different packages!

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

meteoForecast, a package to obtain NWP-WRF forecasts in R

Tue, 2014-07-08 10:37

(This article was first published on Omnia sunt Communia! » R-english, and kindly contributed to R-bloggers)

The Weather Research and Forecasting (WRF) Model is a numerical weather prediction (NWP) system. NWP refers to the simulation and prediction of the atmosphere with a computer model, and WRF is a set of software for this.

meteoForecast is a new R package that implements functions to download data from the Meteogalicia and OpenMeteo NWP-WRF services using the NetCDF Subset Service.

Read this introduction for additional information and examples.

To leave a comment for the author, please follow the link and comment on his blog: Omnia sunt Communia! » R-english. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

googleVis 0.5.3 released

Tue, 2014-07-08 02:50

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

Recently we released googleVis 0.5.3 on CRAN. The package provides an interface between R and Google Charts, allowing you to create interactive web charts from R.

Screen shot of some of the Google Charts
Although this is mainly a maintenance release, I'd like to point out two changes:
  • Default chart width is set to 'automatic' instead of 500 pixels.
  • Intervals for columns roles have to end with the suffix ".i", with "i" being an integer. Several interval columns are allowed, see the roles demo and vignette for more details.
Those changes were required to fix the following issues:
  • The order of y-variables in core charts wasn't maintained. Thanks to John Taveras for reporting this bug.
  • Width and height of googleVis charts were only accepted in pixels, although the Google Charts API uses standard HTML units (for example, '100px', '80em', '60', 'automatic'). If no units are specified the number is assumed to be pixels. This has been fixed. Thanks to Paul Murrell for reporting this issue.
New to googleVis? Review the demo on CRAN.

To leave a comment for the author, please follow the link and comment on his blog: mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

analyze the demographic and health surveys (dhs) with r

Tue, 2014-07-08 02:00

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

professors of public health 101 probably cite the results of the demographic and health surveys (dhs) more than all other data sources combined.  funded by the united states agency for international development (usaid) and administered by the technically-savvy analysts at icf international, this collection of multinational surveys enters its third decade as the authoritative source of international development indicators.  want a sampler of what that all means?  load up the dhs homepage and watch the statistics fly by: 70% of beninese children younger than five sleep under an insecticide-treated bednet / more than a third of kyrgyz kids aged 6-59 months have anemia / only 35% of guinean households have a place to wash yer hands.  this is the front-and-center toolkit for professional epidemiologists who want to know who/what/when/where/why to target a public health intervention in any of these nations.

before you read any more about the microdata, look at http://www.statcompiler.com/  this online table creator might give you access to every statistic that you need, and without the fuss, muss, or missing values of a person-level table.  (bonus: click here to watch me describe dhs-style online table creation from a teleprompter.)  why should you use statcompiler?  because it's quick, easy, and has aggregated statistics for every country at your fingertips.

if that doesn't dissuade you from digging into an actual data set, one more point of order: you'll likely only be given access to a small number of countries.  so when applying for access, it'd be smart to ask for whichever country you are interested in _and also_ for malawi 2004.  that way, you will be able to muck around with my example syntax using the data tables that they were intended for.  if you have already registered, no fear: you can request that malawi be added to your existing project.  i tried requesting every data set.  i failed.  the data archivists do not grant access to more than a country or two, so choose what countries you need sparingly.  while semi-restricted, this is public data:  so long as you have a legitimate research question, you'll be granted access without cost.  this new github repository contains three scripts:


download and import.R

analysis examples.R

replication.R
  • load the 2004 malawi individual recodes file into working memory
  • re-create some of the old school-style strata described in this forum
  • match a single row from pdf page 324 all the way across, deft and all.



click here to view these three scripts



for more detail about the demographic and health surveys (dhs), visit:


notes:

next to the main survey microdata set, you'll see some roman numerals ranging from one through six.  this number indicates which version manual of the survey that particular dataset corresponds to.  different versions have different questions, structures, microdata files: read the entire "general description" section (only about ten pages) of the manual before you even file your request for data access.

these microdata are complex, confusing, occasionally strangely-coded, and often difficult to reconcile with historical versions.  (century month codes? wowza.)  that's understandable, and the survey administrators deserve praise for keeping everything as coherent as they have after thirty years of six major questionnaire revisions of ninety countries of non-english-speaking respondents across this crazy planet of ours.  if you claw through the documentation and cannot find an explanation, you'll want to engage the user forum.  they are thoroughly responsive, impressively knowledgeable, and will help you get to the bottom of it - whatever `it` may be.  before you ask a question here, or really anywhere in life, have a solid answer to whathaveyoutried.  and for heavens' sakes,* prepare a reproducible example for them.

* my non-denominational way of saying heaven's sake.


confidential to sas, spss, stata, and sudaan users: i would shake your hand but you've yet to adopt the statistical equivalent of coughing into your sleeve.  time to transition to r.  :D

To leave a comment for the author, please follow the link and comment on his blog: asdfree by anthony damico. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Reflections on useR! 2014

Mon, 2014-07-07 19:01

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

UseR! 2014, the R user conference held last week in LA, was the most successful yet. Around 700 R users from around the world converged on the UCLA campus to share their experiences with the R language and to socialize with other data scientists, statisticians and others using R.

The week began with a series of 3-hour tutorials on topics as diverse as data management, visualization, statistics and biostatistics, programming, and interactive applications. (Joe Rickert reported on the tutorials from the field last week.) Unlike in previous years, the tutorial day was included in the registration, which meant that all of the sessions were jam-packed with interested R users.

The remaining 2-and-a-half days were packed with keynotes, contributed talks, and social sessions galore. Given the parallel nature of the tracks I couldn't make it to more than a fraction of the talks, but here are a few of the highlights from my notes:

  • In the opening keynote, John Chambers shared the story of the genesis of the S language (which later begat R — see more in this 2013 interview). The three key principles behind the S language were objects (everything is an object), functions (everything that happens is a function call), and interfaces (the language is an interface to other algorithms). In fact, the very first sketch of the language (from a slide made in 1976, shown below), called described it as an "Algorithm Interface".

  • Jeroen Ooms demonstrated OpenCPU, a Web API for scientific computing. OpenCPU makes it possible to integrate R into web-based apps for non R users. You can see some examples of OpenCPU in action in the App Gallery, including the Stockplot app that Jeroen demonstrated.
  • Karthik Ram talked about the ROpenSci project and fostering open science with R. The ROpenSci team has created dozens of R packages, including interfaces to public data sources, data visualization tools, and support for reproducible research. 
  • Google has more than 1000 R users on its internal R mailing list, according to Tim Hesterberg in Google's sponsor talk. (My sponsor talk for Revolution Analytics can be found here.)
  • The heR panel discussion and mixer, which facilitated an excellent conversation about women in data science and the R community. (The useR! conference itself was around 25% women — certainly room for improvement, but better than many math or computer science conferences.)

  • Thomas Fuchs from NASA/JPL, who revealed in two talks that R is used for vision analysis in space exploration (including the Mars Hirise mission and deep-space astronomy). As a NASA buff this was a thrill for me to learn, and I hope to write more about it sometime.
  • The conference banquet under the summer twilight on the UCLA lawns and featuring entertaining acecdotes from David McArthur.

For more on the conference, the slides from many of the talks and tutorials are available at the useR! website. (If you presented there, please submit your slides via a pull request.) Also check out these reviews of the conference from Daniel Gutierrez at InsideBigData and Phyllis Zimbler Miller.

The next conference, useR! 2015 will be held in Aarlborg, Denmark. I'm looking forward to it already!

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Chillin’ at UseR! 2014

Mon, 2014-07-07 15:37

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)

This year’s UseR! conference was held at the University of California in Los Angeles. Despite the great weather and a nearby beach, most of the conference was spent in front of projector screens in 18° c (64° f) rooms because there were so many interesting presentations and tutorials going on. I was lucky to present my R package Bayesian First Aid and the slides can be found here:

There was so much great stuff going on at UseR! and here follows a random sample:

All in all, a great conference! I’m already looking forward to next years UseR! conference which will be held at Aalborg University, not too far from where I live (at least compared to LA).

To leave a comment for the author, please follow the link and comment on his blog: Publishable Stuff. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Sometimes Table is not the Answer – a Faster 2×2 Table

Mon, 2014-07-07 15:28

(This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers)

The table command is great in its simplicity for cross tabulations. I have run into some settings where it is slow and I wanted to demonstrate one simple example here of why you may want to use other functions or write your own tabler. This example is a specific case where, for some examples and functions, you don't need all the good error-checking or flexibility that a function contains, but you want to do something specific and can greatly speed up computation.

Setup of example

I have some brain imaging data. I have a gold standard, where an expert hand-traced (on a computer) a brain scan delineating the brain. I'll refer to this as a brain “mask”. (We use the word mask in imaging to denote a segmented image – either done manually or automatically and I generally reserve the word mask for binary 0/1 values, but others use the term more broadly.)

Using automated methods, I can try to re-create this mask automatically. This image is also binary. I want to simply get a \(2\times2\) contingency table of the automated versus manual masks so I can get sensitivity/specificity/accuracy/etc.

The data

For simplicity and computation, let's consider the images as just a really long vectors. I'll call them manual and auto for the manual and automatic masks, respectively.

These are long logical vectors (9 million elements):

length(manual) [1] 9175040 length(auto) [1] 9175040 head(manual) [1] FALSE FALSE FALSE FALSE FALSE FALSE head(auto) [1] FALSE FALSE FALSE FALSE FALSE FALSE

Naturally, you can run table on this data:

stime = system.time({ tab = table(manual, auto) }) print(stime) user system elapsed 3.294 0.082 3.386 print(tab) auto manual FALSE TRUE FALSE 7941541 11953 TRUE 15384 1206162

The computation took about 3.4 seconds on my MacBook Pro (2013, 16Gb RAM, 2.8GHz Intel i7), which isn't that bad. Realize, though, that I could have hundreds or thousands of these images. We need to speed this up.

What is the essense of what we're doing?

Taking 3.4 seconds to get 4 numbers seems a bit long. As the data is binary, we can simply compute these with the sum command and logical operators.

Let's make the twoXtwo command:

twoXtwo = function(x, y, dnames=c("x", "y")){ tt <- sum( x & y) tf <- sum( x & !y) ft <- sum(!x & y) ff <- sum(!x & !y) tab = matrix(c(ff, tf, ft, tt), ncol=2) n = list(c("FALSE", "TRUE"), c("FALSE", "TRUE")) names(n) = dnames dimnames(tab) = n tab = as.table(tab) dim tab }

And let's see how fast this is (and confirm the result is the same):

stime2 = system.time({ twotab = twoXtwo(manual, auto, dnames=c("manual", "auto")) }) print(stime2) user system elapsed 0.828 0.006 0.835 print(twotab) auto manual FALSE TRUE FALSE 7941541 11953 TRUE 15384 1206162 identical(tab, twotab) [1] TRUE

Viola, twoXtwo runs about 4.06 times faster than table, largely because we knew we did not have to check certain characteristics of the data and that it's a specific example of a table.

More speed captain!

This isn't something astronomical such as a 100-fold increase, but we can increase the speed by not doing all the logical operations on the vectors, but taking differences from the margin sums.

Let's confirm this is faster and accurate by running it on our data:

stime3 = system.time({ twotab2 = twoXtwo2(manual, auto, dnames=c("manual", "auto")) }) print(stime3) user system elapsed 0.198 0.001 0.200 print(twotab2) auto manual FALSE TRUE FALSE 7941541 11953 TRUE 15384 1206162 identical(tab, twotab2) [1] TRUE

Now, if I were going for speed, this code is good enough for me: it runs about 16.93 times faster than table. The one downside is that it is not as readable as twoXtwo. For even greater speed, I could probably move into C++ using the Rcpp package, but that seems overkill for a two by two table.

Other examples of speeding up the calculation can be found here.

Finishing up

I said I wanted sensitivity/specificity/accuracy/etc. so I will show how to get these. I'm going to use prop.table, which I didn't know about for a while when I first started using R (see margin.table too).

ptab = prop.table(twotab) rowtab = prop.table(twotab, margin=1) coltab = prop.table(twotab, margin=2)

As you can see, like the apply command, the prop.table command can either take no margin or take the dimension to divide over (1 for rows, 2 for columns). This means that in ptab, each cell of twotab was divided by the grand total (or sum(tab)). For rowtab, each cell was divided by the rowSums(tab) to get a proportion, and similarly cells in coltab were divided by colSums(tab). After the end of the post, I can show you these are the same.

Getting Performance Measures Accuracy

Getting the accuracy is very easy:

accur = sum(diag(ptab)) print(accur) [1] 0.997 Sensitivity/Specificity

For sensitivity/specificity, the “truth” is the rows of the table, so we want the row percentages:

sens = rowtab["TRUE", "TRUE"] spec = rowtab["FALSE", "FALSE"] print(sens) [1] 0.9874 print(spec) [1] 0.9985 Positive/Negative Predictive Value

We can also get the positive predictive value (PPV) and negative predictive value (NPV) from the column percentages:

ppv = coltab["TRUE", "TRUE"] npv = coltab["FALSE", "FALSE"] print(ppv) [1] 0.9902 print(npv) [1] 0.9981 Conclusions

After using R for years, I find things to still be very intuitive. Sometimes, though, for large data sets or specific examples, you may want to write your own function for speed, checking against the base functions for a few iterations as a double-check. I have heard this to be a nuisance for those who dislike R, as well as hampering reproducibility in some cases. Overall, I find that someone has made a vetted package that does what you want, but there are still simple cases (such as above) where “rolling your own” is OK and easier than adding a dependency to your code.

Aside: How prop.table works Over all margins

For just doing prop.table without a margin, you can think of the table being divided by its sum.

print(round(ptab, 3)) auto manual FALSE TRUE FALSE 0.866 0.001 TRUE 0.002 0.131 print(round(twotab / sum(twotab), 3)) auto manual FALSE TRUE FALSE 0.866 0.001 TRUE 0.002 0.131 Over row margins

For the margin=1, or row percentages, you can think of dividing the table by the row sums.

print(round(rowtab, 3)) auto manual FALSE TRUE FALSE 0.998 0.002 TRUE 0.013 0.987 print(round(twotab / rowSums(twotab), 3)) auto manual FALSE TRUE FALSE 0.998 0.002 TRUE 0.013 0.987 Over column margins

Now for column percentages, you can think of R dividing each cell by its column's sum. This is what prop.table does.

Let's look at the result we should get:

print(round(coltab, 3)) auto manual FALSE TRUE FALSE 0.998 0.010 TRUE 0.002 0.990

But in R, when you take a matrix and then add/divide/subtract/multiply it by a vector, R does the operation column-wise. So when you take column sums on the table, you get a vector with the same number of columns as the table:

print(colSums(twotab)) FALSE TRUE 7956925 1218115

If you try to divide the table by this value, it will not give you the desired result:

print(round( twotab / colSums(twotab), 3)) auto manual FALSE TRUE FALSE 0.998 0.002 TRUE 0.013 0.990 R operations with matrices and vectors

This is because R thinks of a vector as a column vector (or a matrix of 1 column). R then takes the first column of the table and divides the first element from the first column sum (which is correct), but take the second element of the first column and divides it by the second column sum (which is not correct).
A detailed discussion on SO is located here of how to do row-wise operations on matrices.

Back to column percentages

We can use the t( t() ) operation to get the correct answer:

print(round( t( t(twotab) / colSums(twotab)), 3)) auto manual FALSE TRUE FALSE 0.998 0.010 TRUE 0.002 0.990

You can think of R implicitly making the matrix of the correct size with the correct values and then dividing:

cs = colSums(twotab) cs = matrix(cs, nrow=nrow(tab), ncol=ncol(tab), byrow=TRUE) print(cs) [,1] [,2] [1,] 7956925 1218115 [2,] 7956925 1218115 print(round( twotab/cs, 3)) auto manual FALSE TRUE FALSE 0.998 0.010 TRUE 0.002 0.990

Happy tabling!


To leave a comment for the author, please follow the link and comment on his blog: A HopStat and Jump Away » Rbloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

What are the names of the school principals in Mexico?, If your name is Maria, probably this post will interest you. Trends and cool plots from the national education census of Mexico in 2013

Mon, 2014-07-07 13:28

(This article was first published on Computational Biology Blog in fasta format, and kindly contributed to R-bloggers)

I will start this post with a disclaimer:

The main intention of the post is to show how is the distribution of the school principal names in Mexico, for example, to show basic trends regarding about what is the most common nation-wide first name and so on, also to show trends delimited by state and regions.

These trends in data would answer questions such:

1. Are the most common first names distributed equally among the states?
2. Does the states sharing the same region also share the same "naming" behavior?

Additionally, this post includes cool wordclouds.

Finally and the last part of my disclaimer is that, I am really concerned about the privacy of the persons involved. I am not by any sense promoting the exploitation of this personal data, if you decide to download the dataset, I would really ask you to study it and to generate information that is beneficial, do not join the Dark side.

Benjamin

##################
# GETTING THE DATASET AND CODE
##################

The database is located here
The R code can be downloaded here
Additional data can be downloaded here

All the results were computed exploring 202,118 schools across the 32 states of Mexico from the 2013 census

##################
# EXPLORING THE DATA
# WITH WORDCLOUDS
##################

Here is the wordcloud of names (by name, I am referring to first name only), it can be concluded that MARIA is by far the most common first name of a school principal in all Mexican schools, followed by JOSE and then by JUAN

The following wordcloud includes every word in the responsible_name column (this includes, first name, last names). Now the plot shows that besides the common first name of MARIA, also the last names of HERNANDEZ, MARTINEZ and GARCIA are very common.



##################
# EXPLORING THE FREQUENCY
# OF FIRST NAMES (TOP 30 | NATION-WIDE)
##################

Looking at this barplot, the name MARIA is by far the most common name of the Mexican school's principals, with a frequency ~25,000. The next most popular name is JOSE with a frequency of ~7,500


Looking at the same data, just adjusted to represent the % of each name inside the pool of first names we have that MARIA occupy ~11% of the names pool.

##################
# HEATMAPS OF THE DATA
##################

 With this heatmap, my intention is to show the distribution of the top 20 most common first names across all the Mexican states



It can be concluded that there is a small cluster of states which keep the most number of principals named MARIA(but no so fast!, some states, for example Mexico and Distrito Federal are very populated, so I will reduce this effect in the following plot). In summary the message of this plot is the distribution of frequency of the top 20 most frequent first-names across the country.

##################
# CLUSTERS OF THE DATA
##################

For me, a young data-science-padawan, this is my favorite analysis: "hunting down the trends".


The setup of the experiment is very simple, map the top 1,000 most frequent nation-wide names across each state to create a 32 x 1000 matrix (32 states and 1,000 most nation-wide frequent names).

With this matrix, normalize the values by diving each row by the sum of it  (this will minimize the effect of the populated states vs the non populated while maintaining the proportion of the name frequencies per state). Then I just computed a distance matrix and plotted it as a heatmap.

What I can conclude with this plot is that, there are clusters of states that seems to maintain a geographical preference to be clustered within its region, this would be concluded that it is likely that states sharing the same regions would be more likely to share the "naming" trends due to some cultural factors (like the cluster that includes Chihuahua, Sonora and Sinaloa). But this effect is not present in all the clusters.

All images can be downloaded in PDF format here, just don't do evil with them!

Plot 1 here
Plot 2 here
Plot 3 here
Plot 4 here
Plot 5 here
Plot 6 here

Benjamin





To leave a comment for the author, please follow the link and comment on his blog: Computational Biology Blog in fasta format. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

DSC 2014. Day 1

Mon, 2014-07-07 12:15

(This article was first published on JAGS News » R, and kindly contributed to R-bloggers)

Peter Dalgaard, Deepayan Sarkar and Martin Maechler in Brixen

This is a report of the first day of the Directions in Statistical Computing (DSC) conference that took place in Brixen, Italy (See here for an introduction). Performance enhancements were the main theme of the day, covering not just improvements to R itself but alternate language implementations.

Luke Tierney started by presenting the new implementation of reference counting, an experimental feature of the current R development branch. This should reduce unnecessary copies of objects when they are modified, which is an important cause of poor performance. Reference counting replaces the simpler “named” mechanism for determining whether an object needs to be copied when it is modified. It uses the same 2-bit field used by “named”, so counts are limited to the set {0,1,2,3+}, where 3+  is “sticky” and cannot be decremented. However, this is sufficient to stop user-generated replacement functions (i.e. function calls that appear on the left hand side of an assignment) from generating spurious copies.

Radford Neal (University of Toronto) reviewed some of the optimizations to the R interpreter that he has included in Pretty quick R (pqR), his fork of the R-2.15.0 code base that includes many optimizations. Some of these optimizations are described in more detail in his blog. In particular, deferred evaluation and task merging are also used in alternative R implementations Riposte and Renjin (see below). Notably, pqR already includes full reference counting.

Thomas Kalibera talked about changes to the R byte code compiler that can improve performance without changing R semantics. He has been working with Luke Tierney to incorporate these changes into R.

Of course, optimization requires performance analysis. Helena Kotthaus (Technical University of Dortmund) presented a suite of R-based facilities for performance analysis, including an instrumented version of R and a set of benchmarks.  They can all be found in the allr github repository.

In addition to these efforts to improve R, there are several alternate implementations written from scratch. Most of these were well represented at the meeting.

  • Michael Haupt presented FastR, an re-implementation of R in Java, built on top of the Truffle interpreter and the Graal byte compiler.  FastR is a collaboration between Purdue University, Johannes Kepler University Linz, and Oracle Labs.
  • Alexander Bertram (BeDataDriven) presented Renjin another open source R interpreter that runs on the JVM.
  • Justin Talbot (Tableau Software) presented Riposte, a fast interpreter and Just-In-Time (JIT) compiler for R written in C++.

Two further R implementations that were not specifically presented at the meeting should also be mentioned here:

  • CXXR is a refactorization of the R code base by Andrew Runnalls (University of Kent). It replaces parts of the R interpreter written in C with C++ code. Andrew Runnalls was not present at the meeting.
  • TIBCO Enterprise Runtime for R (TERR) is a clean room re-implementation of R. This is the only closed source R implementation. Bill Dunlap from TIBCO was an active participant in the meeting but did not give a presentation. However, a presentation on TERR vs R performance concentrating on memory management issues was given by Michael Sannella (TIBCO) at last year’s UseR! conference, and is available here.

Jan Vitek (Purdue Universtiy) gave a nice overview of these R implementations and the problems they face. One of the key issues is the lack of formal specification of the R language. In other words, there is no document that formally sets out what is allowed, what is not allowed and what is undefined behaviour. The only way to test whether you have a correct re-implementation is to try to run code that also runs on R.  The prize here is to be able to run the 5709 R packages on CRAN  without modification, but many of the re-implementations fall short of this goal.

As Robert Gentleman noted in summing up, we now have a “language community” of computer scientists interested in R in addition to the R user and developer communities.

 


To leave a comment for the author, please follow the link and comment on his blog: JAGS News » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

3 ways that functions can improve your R code

Mon, 2014-07-07 09:49

(This article was first published on R for Public Health, and kindly contributed to R-bloggers)


To leave a comment for the author, please follow the link and comment on his blog: R for Public Health. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

How to pick up 3 numbers from a uniform distribution in a transparent manner?

Mon, 2014-07-07 09:38

(This article was first published on Christophe Ladroue » R, and kindly contributed to R-bloggers)

Over in my previous post, I’m giving away 3 copies of my video course on ggplot2 and shiny. To win a copy, you just need to leave a comment and I will select 3 winners among the n participants at random after a deadline.

But how do I pick 3 winners such that:

  • all players are equally likely to win.
  • no-one can contest the fairness of the selection.

The first thing that comes to mind is to run sample(n,3, replace = FALSE) and report the winners. But how do you know I actually ran this code? I could have decided on the winners well in advance and just pretended to run the code.

A way to approach this issue could be to set the random seed to some value so that anyone suspecting foul play can run the code themselves and get the same answer as me:

Select All Code:1 2 set.seed(someSeed) sample(n, 3, replace=FALSE)

I see at least two problems with it: 1) I could still have selected a seed that gives me the sample I eventually want, and 2) even using a function (e.g. of n the number of participants) as the seed doesn’t guarantee a uniform distribution for each player.

I came up with a plan which I think addresses both the uniform distribution over the players, and the incontestability of the selection.

First, I simplify the problem of selecting 3 winners among n participants to selecting 1 integer from a uniform distribution. This is easy: instead of choosing 3 items among n, I’m selecting 1 of the choose(n,3) possible combinations. Once I’ve sampled 1 number i, I simply use combn(n,3) to generate all the combinations and pick the ith item:
combn(n,3, simplify=FALSE)[[i]].

Second, I have a way to pick a number from a uniform distribution that’s completely transparent. To do this, I pick up a number uniformly at random in a much bigger set (1:N, with N>>choose(n,3)) and project this number back to the interval I’m interested in (1:choose(n,3)). That is, once I have a number j between 1 and N, I use

Select All Code:1 i <- ceiling(choose(n,3) * j /N)

to find a random value uniformly distributed from 1 to choose(n,3)

Ideally you’d want N to be a multiple of choose(n,3) for every outcome to be exactly equally likely, but if N is much larger than choose(n,3), the slight difference in probability for each outcome is negligible (of the order of 1/N).

Now, how do I pick up a number in a bigger set in a transparent manner? We saw that using the seed is fraught with difficulty, so I need something else. I’m going to use something which neither I nor the players have any control over: the UK national lottery, which is a combination of 7 integers from the set {1,…,49}. More precisely, I’m doing this:

  • declare in advance which future lottery draw I’m using.
  • use the lexicographic index of the combination drawn as my number j; j comes from a uniform distribution between 1 and N=choose(49,7), which is a relatively big number.

And this is it: this way, I pick up 3 winners among n and there is no way for me (or the players) to rig the selection. Here is the code I’ll run:

Select All Code:1 2 3 4 5 6 lotteryResult <- c( , , , , , , ) # to be filled in by the actual lottery draw nPlayers <- 200 # to be updated with the number of participants nWinners <- 3 index <- ceiling(choose(nPlayers, nWinners) * lexicographicIndex(lotteryResult, 49) / choose(49,7)) winners <- combn(nPlayers,nWinners,simplify = FALSE)[[index]] cat("\n The winners are:", winners)

The deadline for the competition is Wednesday 09th July at midnight UK time. The next lottery (UK lotto) draw after that is on Saturday 12th July, and I’ll use that draw to decide on the winners, using the code I’ve presented here.

What do you think? Can you poke holes in my strategy? Can you come up with something simpler?

Note about the lexicographic index
It is not terribly difficult to find the index of a combination without generating them all. All you need to do is to count the number of combinations that appeared before. For example, if the combination starts with 3, you know it comes after all the combinations that start with 1 and 2. Here is the code I wrote to go from a combination to its lexicographic index. There’s also a test function after it.

Select All Code:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 # gives the index of the combination if sorted in lexicographic order, starting at 1 # lexicographicIndex(c(2,3,4), 5) = 7 # because [2,3,4] is the 7th item in the lexicographically ordered combinations of length 3 using 5 letters: # 1 2 3 # 1 2 4 # 1 2 5 # 1 3 4 # 1 3 5 # 1 4 5 # 2 3 4 # 2 3 5 # 2 4 5 # 3 4 5 # C. Ladroue # combination is a sequence of unique integers between 1 and alphabetSize   lexicographicIndex <- function(combination, alphabetSize){ combination <- sort(combination)   combinationLength <- length(combination)   index <- 1 for(p in 1:combinationLength){   starting <- ifelse(p == 1, 1 , combination[p-1] + 1) finishing <- combination[p] - 1   if(starting <= finishing) index <- index + sum(sapply(starting:finishing, function(j) choose(alphabetSize - j, combinationLength - p))) } index }     lexicographicIndexTest <- function(){ alphabetSize <- 10 combinationLength <- 3 x <- combn(alphabetSize, combinationLength, simplify = FALSE) cat("\n test all combinations with alphabet size = ",alphabetSize,"and combination length = ",combinationLength,": ", all(sapply(1:length(x), function(index) lexicographicIndex(x[[index]], alphabetSize) == index )) ) cat("\n") }

To leave a comment for the author, please follow the link and comment on his blog: Christophe Ladroue » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Introduction to R for Life Scientists: Course Materials

Mon, 2014-07-07 08:57

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

Last week I taught a three-hour introduction to R workshop for life scientists at UVA's Health Sciences Library.



I broke the workshop into three sections:

In the first half hour or so I presented slides giving an overview of R and why R is so awesome. During this session I emphasized reproducible research and gave a demonstration of using knitr + rmarkdown in RStudio to produce a PDF that can easily be recompiled when data updates.

In the second (longest) section, participants had their laptops out with RStudio open coding along with me as I gave an introduction to R data types, functions, getting help, data frames, subsetting, and plotting. Participants were challenged with an exercise requiring them to create a scatter plot using a subset of the built-in mtcars dataset.

We concluded with an analysis of RNA-seq data using the DESeq2 package. We started with a count matrix and a metadata file (the modENCODE pasilla knockout data packaged with DESeq2), imported the data into a DESeqDataSet object, ran the DESeq pipeline, extracted results, and did some basic visualization (MA-plots, PCA, volcano plots, etc). A future day-long course will cover RNA-seq in more detail (intro UNIX, alignment, & quantitation in the morning; intro R, QC, and differential expression analysis in the afternoon).

I wrote the course materials using knitr, rendered using Jekyll, hosted as a GitHub project page. The rendered course materials can be found at the link below, and the source is on GitHub.

Course Materials: Introduction to R for Life Scientists

Slides:



Cheat Sheet:





Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Four Simple Turtle Graphs To Play With Kids

Mon, 2014-07-07 03:00

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Technology is just a tool: in terms of getting the kids working together and motivating them, the teacher is the most important (Bill Gates)

Some days ago I read this article in R-bloggers and I discovered the TurtleGraphics package. I knew about turtle graphics long time ago and I was thinking of writing a post about them, doing some draw on my own using this technique. In fact, some of my previous drawings could have been done using this package.

Turtle graphics are a good tool to teach kids some very important mathematical and computational concepts such as length, coordinates, location, angles, iterations … Small changes in some of this parameters (especially in angles) can produce very different drawings. Those who want to know more about turtle graphics can read the Wikipedia article.

Following you can find four simple graphics preceded by the code to generate them. Please, send me your creations if you want:

library(TurtleGraphics) turtle_init() turtle_col("gray25") for (i in 1:150) { turtle_forward(dist=1+0.5*i) turtle_right(angle=89.5)} turtle_hide()

library(TurtleGraphics) turtle_init() turtle_col("gray25") turtle_right(angle=234) for (i in 1:100) { turtle_forward(dist=0.9*i) turtle_right(angle=144.3)} turtle_hide()

library(TurtleGraphics) turtle_init() turtle_col("gray25") turtle_setpos(48,36) d=50 for (i in 1:300) { turtle_forward(dist=d) if (i%%4==0) { turtle_right(angle=75) d=d*.95} else turtle_right(angle=90)} turtle_hide()

library(TurtleGraphics) turtle_init() turtle_col("gray25") turtle_setpos(50,35) turtle_right(angle=30) d=25 turtle_setpos(50-d/2,50-d/2*tan(pi/6)) for (i in 1:100) { turtle_forward(dist=d) d=d+.5 turtle_right(angle=120+1)} turtle_hide()


To leave a comment for the author, please follow the link and comment on his blog: Ripples. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs