R bloggers
Buster – a new R package for bagging hierarchical clustering
(This article was first published on Drunks&Lampposts » R, and kindly contributed to Rbloggers)
I recently found myself a bit stuck. I needed to cluster some data. The distances between the data points were not representable in Euclidean space so I had to use hierarchical clustering. But then I wanted stable clusters that would retain their shape as I updated the data set with new observations. This I could do using fuzzy clustering but that (to my knowledge) is only available for clustering techniques that operate in Euclidean space, for example kmeans clustering, not for hierarchical clustering.
It’s not a typical everyday human dilemma. It needs a bit more explanation.
Some backgroundClustering (assuming everyone is happy with this technique but if not click here) typically works on a matrix of distances between data points. We sometimes refer to the distances as dissimilarities – the greater the distance the more dissimilar the data points. So in a simple case the data points might be customers and the distances reflect how different the customers are in terms of demographics, purchasing behaviour etc.
A simple way to achieve this would be to plot your data points against a set of axes representing the things you would like to include in the dissimilarity measure and then just measure the distance between the points. Scaling the data would ensure that none of the attributes are prioritised over the others. For example plot your customers by age and income (both standardised) and then measure the distance between them to determine how similar they are in age and income.
This is Euclidean distance and it is implicit in many popular and powerful clustering algorithms for example kmeans and its variants
But how do you know if your measure of dissimilarity is representable as a Euclidean distance and therefore amenable to kmeans? It’s simple enough if you started with some variables and derived your Euclidean distance but it doesn’t always work this way. For example suppose I am measuring similarity by correlation. Is the absolute of correlation representable as a Euclidean distance? Is there some ndimensional space, where we could plot our data points such that the distance between them represented the absolute value of their correlation?
A quick check to see if your measure is ever going to be representable as euclidean distance is: does it satisfy the triangle inequality?
where is the distance between x and y.
For example love and hate (assuming they are quantifiable) do not satisfy the triangle equality. If x loves y and y loves z this places no constraint on the degree to which x loves z. It could easily be more than the sum of the love of x for y and y for z!
Continue reading at my new professional blog coppelia.ioTo leave a comment for the author, please follow the link and comment on his blog: Drunks&Lampposts » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Recording of OpenCPU talk at #useR2014
(This article was first published on OpenCPU, and kindly contributed to Rbloggers)
A recording of the useR! 2014 prentation about OpenCPU is now available on Youtube. This talk gives a brief (20 minute) motivation and introduction to some of the high level concepts of the OpenCPU system. The video contains mostly screen recording, mixed with some AV footage provided by Timothy Phan (thanks!).
To leave a comment for the author, please follow the link and comment on his blog: OpenCPU. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Can Rcpp fuse ?
(This article was first published on R Enthusiast and R/C++ hero, and kindly contributed to Rbloggers)
One of the features of Rcpp11 people seemed to like during useR is the fuse function. fuse is somewhat similar to the c function in R.
For the purpose of this post, let's simplify what fuse does, and just say that it takes several compatible vectors and fuses them together into one.
// some vectors NumericVector a = {1.0, 2.3}, b = {2.5, 5.7}, c = {4.2, 4.1, 1.4}; // fuse them into a vector of size 7 NumericVector d = fuse(a, b, c) ;One nice thing is that it also handles scalars, e.g :
// some vectors NumericVector a = {1.0, 2.3}, c = {4.2, 4.1, 1.4}; // fuse them into a vector of size 6 NumericVector d = fuse(a, 3.5, c) ;So people seem to like that and I don't blame them, that's a cool feature. Then usually what happens is that I get :
That's a cool feature you have in Rcpp11. Can I use it in Rcpp too ?
I have mixed feelings about that kind of question. It is nice that people acknowledge that this is a nice feature and they want to use it, but at the same time I'd like people to stop thinking that Rcpp11 is some kind of a lab for things that will go later into Rcpp. Some might, but it is unlikely that I will participate in that effort. I'd rather help people migrating to Rcpp11.
Conceptually, fuse is really simple. For each argument, it figures out if this is a scalar or a vector, retrieves the size of the vector to create, creates it, and then loops around to copy data from the inputs to the output. So fuse is just a glorified set of for loops.
In Rcpp11 because we can assume C++11, this can be written using variadic templates, i.e. template functions taking a variable number of arguments. Rcpp has to remain compatible with C++98, so if we wanted to port fuse to Rcpp, we would have to ship a C++98 compatible version. One overload for 2 parameters, one for 3 parameters, one for 4 parameters, ...
This at least brings two problems to the table:
Code bloat. This would be one of these features that add thousands of lines of code. This is hard to maintain and slows the compiler down yet again.
Where to stop ? NumericVector::create arbitrarily stops at 20 parameters. Of course fusing 20 vectors is going to be enough for most uses case, but on occasions people will feel the desire to go beyond, and the compiler will welcome them with lovely multi page errors.
To leave a comment for the author, please follow the link and comment on his blog: R Enthusiast and R/C++ hero. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Are Consumer Preferences Deep or Shallow?
(This article was first published on Engaging Market Research, and kindly contributed to Rbloggers)
John Hauser, because no one questions his expertise, is an excellent spokesperson for the viewpoint that consumer preferences are real, as presented in his article "SelfReflection and Articulated Consumer Preferences." Simply stated, preferences are enduring when formed over time and after careful consideration of actual products. As a consequence, accurate measurement requires us to encourage selfreflection within realistic contexts. "Here true preferences mean the preferences consumers use to make decisions after a serious evaluation of the products that are available on the market."However, serious evaluation takes some time and effort, in fact, a series of separate online tasks including revealed preference plus selfreports of both attributelevel preferences and decisionmaking strategies. We end up with a lot of data from each respondent enabling the estimation of a number of statistical models (e.g., a hierarchical Bayes choicebased conjoint that could be fit using the bayesm R package). All this data is deemed necessary in order for individuals to learn their "true" preferences. Underlying Hauser's approach is a sense of inevitably that a decision maker will arrive at the same resolution regardless of their path as long as they begin with selfreflection.
A more constructivist alternative can be found in my post on "The Mind is Flat!" where it is argued that we lack the cognitive machinery to generate, store and retrieve the extensive array of enduring preferences demanded by utility theory. Although we might remember our previous choice and simply repeat it as a heuristic simplification strategy, working our way through the choice processes anew will likely result in a different set of preferences. Borrowing a phrase from Stephen Jay Gould, replaying the "purchase process tape" will not yield the same outcome unless there are substantial situational constraints forcing the same resolution.
Do preferences control information search, or are preferences evoked by the context? Why would we not expect decision making to be adaptive and responsive to the situation? Enduring preferences may be too rigid for our constantly changing marketplaces. Serendipity has its advantages. After the fact, it is easy to believe that whatever happened had to be. Consider the case study from Hauser's article, and ask what if there had not been an Audi dealership near Maria? Might she been just as satisfied or perhaps even more happy with her second choice? It all works out for the best because we are inventive storytellers and cognitive dissonance will have its way. Isn't this the lesson from choice blindness?
Still, most of marketing research continues to believe in true and enduring preferences that can be articulated by the reflective consumer even when confronted by overwhelming evidence that the human brain is simply not capable of such feats. We recognize patterns, even when they are not there, and we have extensive episodic memory for stuff that we have experienced. We remember faces and places, odors and tastes, and almost every tune we have ever heard, but we are not as proficient when it comes to pin numbers and passwords or dates or even simple calculations. Purchases are tasks that are solved not by looking inside for deep and enduring preferences. Instead, we exploit the situation or task structure and engage in fast thinking with whatever preferences are elicited by the specifics of the setting. Consequently, preferences are shallow and contextual.
As long as preexisting preferences were in control, we were free to present as many alternatives and featurelevels as we wished. The topdown process would search for what it preferred and the rest would be ignored. However, as noted above, context does matter in human judgment and choice. Instead of deciding what you feel like eating (top down), you look at the menu and see what looks good to you (bottom up). Optimal experimental designs that systematically manipulate every possible attribute must be replaced by attempts to mimic the purchase context as closely as possible, not just the checkout but the entire consumer decision journey. Purchase remains the outcome of primary interest, but along the way attention becomes the dependent variable for "a wealth of information creates a poverty of attention" (Herbert A. Simon).
Future data collection will have us following consumers around real or replicated marketplaces and noting what information was accessed and what was done. Our statistical model will then be forced to deal with the sparsity resulting from consumers who concentrate their efforts on only a very few of the many touchpoints available to them. My earlier post on identifying the pathways in the consumer decision journey will provide some idea of what such an analysis might look like. In particular, I show how the R package nmf is able to uncover the underlying structure when the data matrix is sparse. More will follow in subsequent posts.
To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Speed Tests for Rolling/Running Functions
(This article was first published on Timely Portfolio, and kindly contributed to Rbloggers)
I use rolling and running functions almost daily with financial time series. In my post A Whole New World with Chains and Pipes, I made this statement I have noticed that rolling analysis with xts can sometimes be slow. as.matrix is my favorite way to speed things up, since I usually do not need xts powerful indexing and subsetting features. I felt like I be a little more thorough, so I putTo leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R Day at Strata NYC
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
RStudio will teach the new essentials for doing data science in R at this year’s Strata NYC conference, Oct 15 2014.
R Day at Strata is a full day of tutorials that will cover some of the most useful topics in R. You’ll learn how to manipulate and visualize data with R, as well as how to write reproducible, interactive reports that foster collaboration. Topics include:
9:00am – 10:30am
A Grammar of Data Manipulation with dplyr
Speaker: Hadley Wickham
11:00am – 12:30pm
A Reactive Grammar of Graphics with ggvis
Speaker: Winston Chang
1:30pm – 3:00pm
Analytic Web Applications with Shiny
Speaker: Garrett Grolemund
3:30pm – 5:00pm
Reproducible R Reports with Packrat and Rmarkdown
Speaker: JJ Allaire & Yihui Xie
The tutorials are integrated into a cohesive day of instruction. Many of the tools that we’ll cover did not exist six months ago, so you are almost certain to learn something new. You will get the most out of the day if you already know how to load data into R and have some basic experience visualizing and manipulating data.
Visit strataconf.com/stratany2014 to learn more and register! Early bird pricing ends July 31.
Not available on October 15? Check out Hadley’s Advanced R Workshop in New York City on September 8 and 9, 2014.
To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
DSC 2014, Day 2
(This article was first published on JAGS News » R, and kindly contributed to Rbloggers)
This will be somewhat shorter summary of the second day of DSC 2014. There was a wider variety of presentations today, but I am only going to discuss those that touched on the main theme of the day, which was reproducibility.
Subscribers to the Rdevel mailing list will recall a long discussion started in March by Jeroen Ooms called A case for freezing CRAN. Jeroen described the difficulties reproducing previous analyses (either those done by other people or done by oneself at a previous time) due to changes introduced in R packages. Jeroem proposed that a frozen version of CRAN be made available for each R release so that all users with the same version of R would have access to the same set of packages. Several presentations at DSC directly addressed this challenge
Simon Urbanek (AT&T) presented RCloud, an environment for collaboratively sharing R scripts. Data and code are pushed into the cloud so that the user does not care where it is stored. All user work is contained in a “notebook” which is automatically under version control. Notebooks can be shared by simply sharing a URL with another RCloud user.
JJ Allaire presented packrat, a dependency management system for R packages that essentially creates a separate library for each project. Packrat creates isolated, portable and reproducible environments for R packages, and does not require any changes to CRAN.
Michael Lawrence presented GRAN, a package management system, but my notes don’t allow me to say much about it here and the slides are not on the web site to jog my memory.
David Smith from Revolution Analytics presented the Reproducible R Toolkit (RRT) and its serverside counterpart the Marmoset R Archive Network (MRAN). This is the solution that most closely resembles Jeroem’s frozen CRAN. David suggested that it would be more useful to ordinary R users, as opposed to R developers who may prefer the approach taken by packrat or GRAN.
Special mention must go to Romain Francois, who drove through the night from the Rencontres R conference in Monpellier to Brixen to give a talk about Rcpp11, then went straight on to Paris to fly out to the UseR! conference in California. I told Romain he was crazy.
Overall this was an excellent meeting. It was very important to get together people who had been communicating by email only for a long period and I was impressed at the increasing investment in R by companies.
To leave a comment for the author, please follow the link and comment on his blog: JAGS News » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Dependencies of popular R packages
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
With the growing popularity of R, there is an associated increase in the popularity of online forums to ask questions. One of the most popular sites is StackOverflow, where more than 60 thousand questions have been asked and tagged to be related to R.
On the same page, you can also find related tags. Among the top 15 tags associated with R, several are also packages you can find on CRAN:
 ggplot2
 data.table
 plyr
 knitr
 shiny
 xts
 lattice
It very easy to install these packages directly from CRAN using the R function install.packages(), but this will also install all these package dependencies.
This leads to the question: How can one determine all these dependencies?
It is possible to do this using the function available.packages() and then query the resulting object.
But it is easier to answer this question using the functions in a new package, called miniCRAN, that I am working on. I have designed miniCRAN to allow you to create a mini version of CRAN behind a corporate firewall. You can use some of the function in miniCRAN to list packages and their dependencies, in particular:
 pkgAvail()
 pkgDep()
 makeDepGraph()
I illustrate these functions in the following scripts.
Start by loading miniCRAN and retrieving the available packages on CRAN. Use the function pkgAvail() to do this:
library(miniCRAN) pkgdata < pkgAvail(repos = c(CRAN="http://cran.revolutionanalytics.com"), type="source") head(pkgdata[, c("Depends", "Suggests")]) ## Depends Suggests ## A3 "R (>= 2.15.0), xtable, pbapply" "randomForest, e1071" ## abc "R (>= 2.10), nnet, quantreg, MASS" NA ## abcdeFBA "Rglpk,rgl,corrplot,lattice,R (>= 2.10)" "LIM,sybil" ## ABCExtremes "SpatialExtremes, combinat" NA ## ABCoptim NA NA ## ABCp2 "MASS" NA
Next, use the function pkgDep() to get dependencies of the 7 popular tags on StackOverflow:
tags < c("ggplot2", "data.table", "plyr", "knitr", "shiny", "xts", "lattice") pkgList < pkgDep(tags, availPkgs=pkgdata, suggests=TRUE) pkgList ## [1] "abind" "bit64" "bitops" "Cairo" ## [5] "caTools" "chron" "codetools" "colorspace" ## [9] "data.table" "dichromat" "digest" "evaluate" ## [13] "fastmatch" "foreach" "formatR" "fts" ## [17] "ggplot2" "gtable" "hexbin" "highr" ## [21] "Hmisc" "htmltools" "httpuv" "iterators" ## [25] "itertools" "its" "KernSmooth" "knitr" ## [29] "labeling" "lattice" "mapproj" "maps" ## [33] "maptools" "markdown" "MASS" "mgcv" ## [37] "mime" "multcomp" "munsell" "nlme" ## [41] "plyr" "proto" "quantreg" "RColorBrewer" ## [45] "Rcpp" "RCurl" "reshape" "reshape2" ## [49] "rgl" "RJSONIO" "scales" "shiny" ## [53] "stringr" "testit" "testthat" "timeDate" ## [57] "timeSeries" "tis" "tseries" "XML" ## [61] "xtable" "xts" "zoo"
Wow, look how these 7 packages have dependencies on 63 other packages!
You can graphically visualise these dependencies in a graph, by using the function makeDepGraph():
p < makeDepGraph(pkgList, availPkgs=pkgdata) library(igraph) plotColours < c("grey80", "orange") topLevel < as.numeric(V(p)$name %in% tags) par(mai=rep(0.25, 4)) set.seed(50) vColor < plotColours[1 + topLevel] plot(p, vertex.size=8, edge.arrow.size=0.5, vertex.label.cex=0.7, vertex.label.color="black", vertex.color=vColor) legend(x=0.9, y=0.9, legend=c("Dependencies", "Initial list"), col=c(plotColours, NA), pch=19, cex=0.9) text(0.9, 0.75, expression(xts %>% zoo), adj=0, cex=0.9) text(0.9, 0.8, "xts depends on zoo", adj=0, cex=0.9) title("Package dependency graph")So, if you wanted to install the 7 most popular packages R packages (according to StackOverflow), R will in fact download and install up to 63 different packages!
To leave a comment for the author, please follow the link and comment on his blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
meteoForecast, a package to obtain NWPWRF forecasts in R
(This article was first published on Omnia sunt Communia! » Renglish, and kindly contributed to Rbloggers)
The Weather Research and Forecasting (WRF) Model is a numerical weather prediction (NWP) system. NWP refers to the simulation and prediction of the atmosphere with a computer model, and WRF is a set of software for this.
meteoForecast is a new R package that implements functions to download data from the Meteogalicia and OpenMeteo NWPWRF services using the NetCDF Subset Service.
Read this introduction for additional information and examples.
To leave a comment for the author, please follow the link and comment on his blog: Omnia sunt Communia! » Renglish. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
googleVis 0.5.3 released
(This article was first published on mages' blog, and kindly contributed to Rbloggers)
Recently we released googleVis 0.5.3 on CRAN. The package provides an interface between R and Google Charts, allowing you to create interactive web charts from R.Screen shot of some of the Google Charts
Although this is mainly a maintenance release, I'd like to point out two changes:
 Default chart width is set to 'automatic' instead of 500 pixels.
 Intervals for columns roles have to end with the suffix ".i", with "i" being an integer. Several interval columns are allowed, see the roles demo and vignette for more details.
 The order of yvariables in core charts wasn't maintained. Thanks to John Taveras for reporting this bug.
 Width and height of googleVis charts were only accepted in pixels, although the Google Charts API uses standard HTML units (for example, '100px', '80em', '60', 'automatic'). If no units are specified the number is assumed to be pixels. This has been fixed. Thanks to Paul Murrell for reporting this issue.
To leave a comment for the author, please follow the link and comment on his blog: mages' blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
analyze the demographic and health surveys (dhs) with r
(This article was first published on asdfree by anthony damico, and kindly contributed to Rbloggers)
professors of public health 101 probably cite the results of the demographic and health surveys (dhs) more than all other data sources combined. funded by the united states agency for international development (usaid) and administered by the technicallysavvy analysts at icf international, this collection of multinational surveys enters its third decade as the authoritative source of international development indicators. want a sampler of what that all means? load up the dhs homepage and watch the statistics fly by: 70% of beninese children younger than five sleep under an insecticidetreated bednet / more than a third of kyrgyz kids aged 659 months have anemia / only 35% of guinean households have a place to wash yer hands. this is the frontandcenter toolkit for professional epidemiologists who want to know who/what/when/where/why to target a public health intervention in any of these nations.before you read any more about the microdata, look at http://www.statcompiler.com/ this online table creator might give you access to every statistic that you need, and without the fuss, muss, or missing values of a personlevel table. (bonus: click here to watch me describe dhsstyle online table creation from a teleprompter.) why should you use statcompiler? because it's quick, easy, and has aggregated statistics for every country at your fingertips.
if that doesn't dissuade you from digging into an actual data set, one more point of order: you'll likely only be given access to a small number of countries. so when applying for access, it'd be smart to ask for whichever country you are interested in _and also_ for malawi 2004. that way, you will be able to muck around with my example syntax using the data tables that they were intended for. if you have already registered, no fear: you can request that malawi be added to your existing project. i tried requesting every data set. i failed. the data archivists do not grant access to more than a country or two, so choose what countries you need sparingly. while semirestricted, this is public data: so long as you have a legitimate research question, you'll be granted access without cost. this new github repository contains three scripts:
download and import.R
 pretend you're a real boy and log into dhsprogram.com's data download service
 detect which countries, years, and survey data sets you've been granted access to
 download, import, save each of those files onto your local computer
analysis examples.R
 load the 2004 malawi individual recodes file into working memory
 construct the complex sample survey object
 run a battery of stats, standard errors, subsets, etc. that recreate some rows of pdf page 324 of this publication
replication.R
 load the 2004 malawi individual recodes file into working memory
 recreate some of the old schoolstyle strata described in this forum
 match a single row from pdf page 324 all the way across, deft and all.
click here to view these three scripts
for more detail about the demographic and health surveys (dhs), visit:
 the hand calculations for rates, risks, and other epidemiological measures
 same as above, but in printable pdf form
 the icfi landing page for dhs, perhaps you'll prefer their summary to mine
notes:
next to the main survey microdata set, you'll see some roman numerals ranging from one through six. this number indicates which version manual of the survey that particular dataset corresponds to. different versions have different questions, structures, microdata files: read the entire "general description" section (only about ten pages) of the manual before you even file your request for data access.
these microdata are complex, confusing, occasionally strangelycoded, and often difficult to reconcile with historical versions. (century month codes? wowza.) that's understandable, and the survey administrators deserve praise for keeping everything as coherent as they have after thirty years of six major questionnaire revisions of ninety countries of nonenglishspeaking respondents across this crazy planet of ours. if you claw through the documentation and cannot find an explanation, you'll want to engage the user forum. they are thoroughly responsive, impressively knowledgeable, and will help you get to the bottom of it  whatever `it` may be. before you ask a question here, or really anywhere in life, have a solid answer to whathaveyoutried. and for heavens' sakes,* prepare a reproducible example for them.
* my nondenominational way of saying heaven's sake.
confidential to sas, spss, stata, and sudaan users: i would shake your hand but you've yet to adopt the statistical equivalent of coughing into your sleeve. time to transition to r. :D
To leave a comment for the author, please follow the link and comment on his blog: asdfree by anthony damico. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Reflections on useR! 2014
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
UseR! 2014, the R user conference held last week in LA, was the most successful yet. Around 700 R users from around the world converged on the UCLA campus to share their experiences with the R language and to socialize with other data scientists, statisticians and others using R.
The week began with a series of 3hour tutorials on topics as diverse as data management, visualization, statistics and biostatistics, programming, and interactive applications. (Joe Rickert reported on the tutorials from the field last week.) Unlike in previous years, the tutorial day was included in the registration, which meant that all of the sessions were jampacked with interested R users.
The remaining 2andahalf days were packed with keynotes, contributed talks, and social sessions galore. Given the parallel nature of the tracks I couldn't make it to more than a fraction of the talks, but here are a few of the highlights from my notes:
 In the opening keynote, John Chambers shared the story of the genesis of the S language (which later begat R — see more in this 2013 interview). The three key principles behind the S language were objects (everything is an object), functions (everything that happens is a function call), and interfaces (the language is an interface to other algorithms). In fact, the very first sketch of the language (from a slide made in 1976, shown below), called described it as an "Algorithm Interface".
 Jeroen Ooms demonstrated OpenCPU, a Web API for scientific computing. OpenCPU makes it possible to integrate R into webbased apps for non R users. You can see some examples of OpenCPU in action in the App Gallery, including the Stockplot app that Jeroen demonstrated.
 Karthik Ram talked about the ROpenSci project and fostering open science with R. The ROpenSci team has created dozens of R packages, including interfaces to public data sources, data visualization tools, and support for reproducible research.
 Google has more than 1000 R users on its internal R mailing list, according to Tim Hesterberg in Google's sponsor talk. (My sponsor talk for Revolution Analytics can be found here.)
 The heR panel discussion and mixer, which facilitated an excellent conversation about women in data science and the R community. (The useR! conference itself was around 25% women — certainly room for improvement, but better than many math or computer science conferences.)
 Thomas Fuchs from NASA/JPL, who revealed in two talks that R is used for vision analysis in space exploration (including the Mars Hirise mission and deepspace astronomy). As a NASA buff this was a thrill for me to learn, and I hope to write more about it sometime.
 The conference banquet under the summer twilight on the UCLA lawns and featuring entertaining acecdotes from David McArthur.
For more on the conference, the slides from many of the talks and tutorials are available at the useR! website. (If you presented there, please submit your slides via a pull request.) Also check out these reviews of the conference from Daniel Gutierrez at InsideBigData and Phyllis Zimbler Miller.
The next conference, useR! 2015 will be held in Aarlborg, Denmark. I'm looking forward to it already!
To leave a comment for the author, please follow the link and comment on his blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Chillin’ at UseR! 2014
(This article was first published on Publishable Stuff, and kindly contributed to Rbloggers)
This year’s UseR! conference was held at the University of California in Los Angeles. Despite the great weather and a nearby beach, most of the conference was spent in front of projector screens in 18° c (64° f) rooms because there were so many interesting presentations and tutorials going on. I was lucky to present my R package Bayesian First Aid and the slides can be found here:
There was so much great stuff going on at UseR! and here follows a random sample:

John Chambers on Interfaces, Efficiency and Big Data. One of the creators of S (the predecessor of R) talked about the history of R and exiting new developments such as Rcpp11. He was also kind enough to to sign my copy of S: An Interactive Environment for Data Analysis and Graphics, the original S book from 1984 :)

Yihui Xie the Knitr Ninja. Yihui held the most amazing presentation about how to be a Knitr ninja using only an R script and sound effects. The “anime sword” sound effect used by Yihui is just now available in the development version of beepr and can be played by running beep("sword").

Romain François held both a tutorial and a presentation on the Rcpp11 package, a most convenient way of connecting R and C++.

Dirk Eddelbuettel held a keynote on the topic of R, C++ and Rcpp, another convenient way of connecting R and C++. Do we see a theme here? He also talked about Docker which I never heard of before, which allows sort of lightweight virtual machines which can be easily built and distributed (this is my interpretation, which might be a bit off).

Rstudio was otherwise running the show with great presentation with Winston Chang on ggvis, Joe Cheng on Shiny, J.J. Allaire and Kevin Ushey on Packrat  A Dependency Management System for R, Jeff Allen on The Next Generation of R Markdown and, of course, Hadley Wickham on dplyr: a grammar of data manipulation.

Dieter De Mesmaeker presented a poster on Rdocumentation.org a really nice webinterface to the documentation of R.
All in all, a great conference! I’m already looking forward to next years UseR! conference which will be held at Aalborg University, not too far from where I live (at least compared to LA).
To leave a comment for the author, please follow the link and comment on his blog: Publishable Stuff. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Sometimes Table is not the Answer – a Faster 2×2 Table
(This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to Rbloggers)
The table command is great in its simplicity for cross tabulations. I have run into some settings where it is slow and I wanted to demonstrate one simple example here of why you may want to use other functions or write your own tabler. This example is a specific case where, for some examples and functions, you don't need all the good errorchecking or flexibility that a function contains, but you want to do something specific and can greatly speed up computation.
Setup of exampleI have some brain imaging data. I have a gold standard, where an expert handtraced (on a computer) a brain scan delineating the brain. I'll refer to this as a brain “mask”. (We use the word mask in imaging to denote a segmented image – either done manually or automatically and I generally reserve the word mask for binary 0/1 values, but others use the term more broadly.)
Using automated methods, I can try to recreate this mask automatically. This image is also binary. I want to simply get a \(2\times2\) contingency table of the automated versus manual masks so I can get sensitivity/specificity/accuracy/etc.
The dataFor simplicity and computation, let's consider the images as just a really long vectors. I'll call them manual and auto for the manual and automatic masks, respectively.
These are long logical vectors (9 million elements):
length(manual) [1] 9175040 length(auto) [1] 9175040 head(manual) [1] FALSE FALSE FALSE FALSE FALSE FALSE head(auto) [1] FALSE FALSE FALSE FALSE FALSE FALSENaturally, you can run table on this data:
stime = system.time({ tab = table(manual, auto) }) print(stime) user system elapsed 3.294 0.082 3.386 print(tab) auto manual FALSE TRUE FALSE 7941541 11953 TRUE 15384 1206162The computation took about 3.4 seconds on my MacBook Pro (2013, 16Gb RAM, 2.8GHz Intel i7), which isn't that bad. Realize, though, that I could have hundreds or thousands of these images. We need to speed this up.
What is the essense of what we're doing?Taking 3.4 seconds to get 4 numbers seems a bit long. As the data is binary, we can simply compute these with the sum command and logical operators.
Let's make the twoXtwo command:
twoXtwo = function(x, y, dnames=c("x", "y")){ tt < sum( x & y) tf < sum( x & !y) ft < sum(!x & y) ff < sum(!x & !y) tab = matrix(c(ff, tf, ft, tt), ncol=2) n = list(c("FALSE", "TRUE"), c("FALSE", "TRUE")) names(n) = dnames dimnames(tab) = n tab = as.table(tab) dim tab }And let's see how fast this is (and confirm the result is the same):
stime2 = system.time({ twotab = twoXtwo(manual, auto, dnames=c("manual", "auto")) }) print(stime2) user system elapsed 0.828 0.006 0.835 print(twotab) auto manual FALSE TRUE FALSE 7941541 11953 TRUE 15384 1206162 identical(tab, twotab) [1] TRUEViola, twoXtwo runs about 4.06 times faster than table, largely because we knew we did not have to check certain characteristics of the data and that it's a specific example of a table.
More speed captain!This isn't something astronomical such as a 100fold increase, but we can increase the speed by not doing all the logical operations on the vectors, but taking differences from the margin sums.
Let's confirm this is faster and accurate by running it on our data:
stime3 = system.time({ twotab2 = twoXtwo2(manual, auto, dnames=c("manual", "auto")) }) print(stime3) user system elapsed 0.198 0.001 0.200 print(twotab2) auto manual FALSE TRUE FALSE 7941541 11953 TRUE 15384 1206162 identical(tab, twotab2) [1] TRUENow, if I were going for speed, this code is good enough for me: it runs about 16.93 times faster than table. The one downside is that it is not as readable as twoXtwo. For even greater speed, I could probably move into C++ using the Rcpp package, but that seems overkill for a two by two table.
Other examples of speeding up the calculation can be found here.
Finishing upI said I wanted sensitivity/specificity/accuracy/etc. so I will show how to get these. I'm going to use prop.table, which I didn't know about for a while when I first started using R (see margin.table too).
ptab = prop.table(twotab) rowtab = prop.table(twotab, margin=1) coltab = prop.table(twotab, margin=2)As you can see, like the apply command, the prop.table command can either take no margin or take the dimension to divide over (1 for rows, 2 for columns). This means that in ptab, each cell of twotab was divided by the grand total (or sum(tab)). For rowtab, each cell was divided by the rowSums(tab) to get a proportion, and similarly cells in coltab were divided by colSums(tab). After the end of the post, I can show you these are the same.
Getting Performance Measures AccuracyGetting the accuracy is very easy:
accur = sum(diag(ptab)) print(accur) [1] 0.997 Sensitivity/SpecificityFor sensitivity/specificity, the “truth” is the rows of the table, so we want the row percentages:
sens = rowtab["TRUE", "TRUE"] spec = rowtab["FALSE", "FALSE"] print(sens) [1] 0.9874 print(spec) [1] 0.9985 Positive/Negative Predictive ValueWe can also get the positive predictive value (PPV) and negative predictive value (NPV) from the column percentages:
ppv = coltab["TRUE", "TRUE"] npv = coltab["FALSE", "FALSE"] print(ppv) [1] 0.9902 print(npv) [1] 0.9981 ConclusionsAfter using R for years, I find things to still be very intuitive. Sometimes, though, for large data sets or specific examples, you may want to write your own function for speed, checking against the base functions for a few iterations as a doublecheck. I have heard this to be a nuisance for those who dislike R, as well as hampering reproducibility in some cases. Overall, I find that someone has made a vetted package that does what you want, but there are still simple cases (such as above) where “rolling your own” is OK and easier than adding a dependency to your code.
Aside: How prop.table works Over all marginsFor just doing prop.table without a margin, you can think of the table being divided by its sum.
print(round(ptab, 3)) auto manual FALSE TRUE FALSE 0.866 0.001 TRUE 0.002 0.131 print(round(twotab / sum(twotab), 3)) auto manual FALSE TRUE FALSE 0.866 0.001 TRUE 0.002 0.131 Over row marginsFor the margin=1, or row percentages, you can think of dividing the table by the row sums.
print(round(rowtab, 3)) auto manual FALSE TRUE FALSE 0.998 0.002 TRUE 0.013 0.987 print(round(twotab / rowSums(twotab), 3)) auto manual FALSE TRUE FALSE 0.998 0.002 TRUE 0.013 0.987 Over column marginsNow for column percentages, you can think of R dividing each cell by its column's sum. This is what prop.table does.
Let's look at the result we should get:
print(round(coltab, 3)) auto manual FALSE TRUE FALSE 0.998 0.010 TRUE 0.002 0.990But in R, when you take a matrix and then add/divide/subtract/multiply it by a vector, R does the operation columnwise. So when you take column sums on the table, you get a vector with the same number of columns as the table:
print(colSums(twotab)) FALSE TRUE 7956925 1218115If you try to divide the table by this value, it will not give you the desired result:
print(round( twotab / colSums(twotab), 3)) auto manual FALSE TRUE FALSE 0.998 0.002 TRUE 0.013 0.990 R operations with matrices and vectorsThis is because R thinks of a vector as a column vector (or a matrix of 1 column). R then takes the first column of the table and divides the first element from the first column sum (which is correct), but take the second element of the first column and divides it by the second column sum (which is not correct).
A detailed discussion on SO is located here of how to do rowwise operations on matrices.
We can use the t( t() ) operation to get the correct answer:
print(round( t( t(twotab) / colSums(twotab)), 3)) auto manual FALSE TRUE FALSE 0.998 0.010 TRUE 0.002 0.990You can think of R implicitly making the matrix of the correct size with the correct values and then dividing:
cs = colSums(twotab) cs = matrix(cs, nrow=nrow(tab), ncol=ncol(tab), byrow=TRUE) print(cs) [,1] [,2] [1,] 7956925 1218115 [2,] 7956925 1218115 print(round( twotab/cs, 3)) auto manual FALSE TRUE FALSE 0.998 0.010 TRUE 0.002 0.990Happy tabling!
To leave a comment for the author, please follow the link and comment on his blog: A HopStat and Jump Away » Rbloggers. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
What are the names of the school principals in Mexico?, If your name is Maria, probably this post will interest you. Trends and cool plots from the national education census of Mexico in 2013
(This article was first published on Computational Biology Blog in fasta format, and kindly contributed to Rbloggers)
I will start this post with a disclaimer:The main intention of the post is to show how is the distribution of the school principal names in Mexico, for example, to show basic trends regarding about what is the most common nationwide first name and so on, also to show trends delimited by state and regions.
These trends in data would answer questions such:
1. Are the most common first names distributed equally among the states?
2. Does the states sharing the same region also share the same "naming" behavior?
Additionally, this post includes cool wordclouds.
Finally and the last part of my disclaimer is that, I am really concerned about the privacy of the persons involved. I am not by any sense promoting the exploitation of this personal data, if you decide to download the dataset, I would really ask you to study it and to generate information that is beneficial, do not join the Dark side.
Benjamin
##################
# GETTING THE DATASET AND CODE
##################
The database is located here
The R code can be downloaded here
Additional data can be downloaded here
All the results were computed exploring 202,118 schools across the 32 states of Mexico from the 2013 census
##################
# EXPLORING THE DATA
# WITH WORDCLOUDS
##################
Here is the wordcloud of names (by name, I am referring to first name only), it can be concluded that MARIA is by far the most common first name of a school principal in all Mexican schools, followed by JOSE and then by JUAN
The following wordcloud includes every word in the responsible_name column (this includes, first name, last names). Now the plot shows that besides the common first name of MARIA, also the last names of HERNANDEZ, MARTINEZ and GARCIA are very common.
##################
# EXPLORING THE FREQUENCY
# OF FIRST NAMES (TOP 30  NATIONWIDE)
##################
Looking at this barplot, the name MARIA is by far the most common name of the Mexican school's principals, with a frequency ~25,000. The next most popular name is JOSE with a frequency of ~7,500
Looking at the same data, just adjusted to represent the % of each name inside the pool of first names we have that MARIA occupy ~11% of the names pool.
##################
# HEATMAPS OF THE DATA
##################
With this heatmap, my intention is to show the distribution of the top 20 most common first names across all the Mexican states
It can be concluded that there is a small cluster of states which keep the most number of principals named MARIA(but no so fast!, some states, for example Mexico and Distrito Federal are very populated, so I will reduce this effect in the following plot). In summary the message of this plot is the distribution of frequency of the top 20 most frequent firstnames across the country.
##################
# CLUSTERS OF THE DATA
##################
For me, a young datasciencepadawan, this is my favorite analysis: "hunting down the trends".
The setup of the experiment is very simple, map the top 1,000 most frequent nationwide names across each state to create a 32 x 1000 matrix (32 states and 1,000 most nationwide frequent names).
With this matrix, normalize the values by diving each row by the sum of it (this will minimize the effect of the populated states vs the non populated while maintaining the proportion of the name frequencies per state). Then I just computed a distance matrix and plotted it as a heatmap.
What I can conclude with this plot is that, there are clusters of states that seems to maintain a geographical preference to be clustered within its region, this would be concluded that it is likely that states sharing the same regions would be more likely to share the "naming" trends due to some cultural factors (like the cluster that includes Chihuahua, Sonora and Sinaloa). But this effect is not present in all the clusters.
All images can be downloaded in PDF format here, just don't do evil with them!
Plot 1 here
Plot 2 here
Plot 3 here
Plot 4 here
Plot 5 here
Plot 6 here
Benjamin
To leave a comment for the author, please follow the link and comment on his blog: Computational Biology Blog in fasta format. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
DSC 2014. Day 1
(This article was first published on JAGS News » R, and kindly contributed to Rbloggers)
This is a report of the first day of the Directions in Statistical Computing (DSC) conference that took place in Brixen, Italy (See here for an introduction). Performance enhancements were the main theme of the day, covering not just improvements to R itself but alternate language implementations.
Luke Tierney started by presenting the new implementation of reference counting, an experimental feature of the current R development branch. This should reduce unnecessary copies of objects when they are modified, which is an important cause of poor performance. Reference counting replaces the simpler “named” mechanism for determining whether an object needs to be copied when it is modified. It uses the same 2bit field used by “named”, so counts are limited to the set {0,1,2,3+}, where 3+ is “sticky” and cannot be decremented. However, this is sufficient to stop usergenerated replacement functions (i.e. function calls that appear on the left hand side of an assignment) from generating spurious copies.
Radford Neal (University of Toronto) reviewed some of the optimizations to the R interpreter that he has included in Pretty quick R (pqR), his fork of the R2.15.0 code base that includes many optimizations. Some of these optimizations are described in more detail in his blog. In particular, deferred evaluation and task merging are also used in alternative R implementations Riposte and Renjin (see below). Notably, pqR already includes full reference counting.
Thomas Kalibera talked about changes to the R byte code compiler that can improve performance without changing R semantics. He has been working with Luke Tierney to incorporate these changes into R.
Of course, optimization requires performance analysis. Helena Kotthaus (Technical University of Dortmund) presented a suite of Rbased facilities for performance analysis, including an instrumented version of R and a set of benchmarks. They can all be found in the allr github repository.
In addition to these efforts to improve R, there are several alternate implementations written from scratch. Most of these were well represented at the meeting.
 Michael Haupt presented FastR, an reimplementation of R in Java, built on top of the Truffle interpreter and the Graal byte compiler. FastR is a collaboration between Purdue University, Johannes Kepler University Linz, and Oracle Labs.
 Alexander Bertram (BeDataDriven) presented Renjin another open source R interpreter that runs on the JVM.
 Justin Talbot (Tableau Software) presented Riposte, a fast interpreter and JustInTime (JIT) compiler for R written in C++.
Two further R implementations that were not specifically presented at the meeting should also be mentioned here:
 CXXR is a refactorization of the R code base by Andrew Runnalls (University of Kent). It replaces parts of the R interpreter written in C with C++ code. Andrew Runnalls was not present at the meeting.
 TIBCO Enterprise Runtime for R (TERR) is a clean room reimplementation of R. This is the only closed source R implementation. Bill Dunlap from TIBCO was an active participant in the meeting but did not give a presentation. However, a presentation on TERR vs R performance concentrating on memory management issues was given by Michael Sannella (TIBCO) at last year’s UseR! conference, and is available here.
Jan Vitek (Purdue Universtiy) gave a nice overview of these R implementations and the problems they face. One of the key issues is the lack of formal specification of the R language. In other words, there is no document that formally sets out what is allowed, what is not allowed and what is undefined behaviour. The only way to test whether you have a correct reimplementation is to try to run code that also runs on R. The prize here is to be able to run the 5709 R packages on CRAN without modification, but many of the reimplementations fall short of this goal.
As Robert Gentleman noted in summing up, we now have a “language community” of computer scientists interested in R in addition to the R user and developer communities.
To leave a comment for the author, please follow the link and comment on his blog: JAGS News » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
3 ways that functions can improve your R code
(This article was first published on R for Public Health, and kindly contributed to Rbloggers)
To leave a comment for the author, please follow the link and comment on his blog: R for Public Health. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to pick up 3 numbers from a uniform distribution in a transparent manner?
(This article was first published on Christophe Ladroue » R, and kindly contributed to Rbloggers)
Over in my previous post, I’m giving away 3 copies of my video course on ggplot2 and shiny. To win a copy, you just need to leave a comment and I will select 3 winners among the n participants at random after a deadline.
But how do I pick 3 winners such that:
 all players are equally likely to win.
 noone can contest the fairness of the selection.
The first thing that comes to mind is to run sample(n,3, replace = FALSE) and report the winners. But how do you know I actually ran this code? I could have decided on the winners well in advance and just pretended to run the code.
A way to approach this issue could be to set the random seed to some value so that anyone suspecting foul play can run the code themselves and get the same answer as me:
Select All Code:1 2 set.seed(someSeed) sample(n, 3, replace=FALSE)I see at least two problems with it: 1) I could still have selected a seed that gives me the sample I eventually want, and 2) even using a function (e.g. of n the number of participants) as the seed doesn’t guarantee a uniform distribution for each player.
I came up with a plan which I think addresses both the uniform distribution over the players, and the incontestability of the selection.
First, I simplify the problem of selecting 3 winners among n participants to selecting 1 integer from a uniform distribution. This is easy: instead of choosing 3 items among n, I’m selecting 1 of the choose(n,3) possible combinations. Once I’ve sampled 1 number i, I simply use combn(n,3) to generate all the combinations and pick the ith item:
combn(n,3, simplify=FALSE)[[i]].
Second, I have a way to pick a number from a uniform distribution that’s completely transparent. To do this, I pick up a number uniformly at random in a much bigger set (1:N, with N>>choose(n,3)) and project this number back to the interval I’m interested in (1:choose(n,3)). That is, once I have a number j between 1 and N, I use
Select All Code:1 i < ceiling(choose(n,3) * j /N)to find a random value uniformly distributed from 1 to choose(n,3)
Ideally you’d want N to be a multiple of choose(n,3) for every outcome to be exactly equally likely, but if N is much larger than choose(n,3), the slight difference in probability for each outcome is negligible (of the order of 1/N).
Now, how do I pick up a number in a bigger set in a transparent manner? We saw that using the seed is fraught with difficulty, so I need something else. I’m going to use something which neither I nor the players have any control over: the UK national lottery, which is a combination of 7 integers from the set {1,…,49}. More precisely, I’m doing this:
 declare in advance which future lottery draw I’m using.
 use the lexicographic index of the combination drawn as my number j; j comes from a uniform distribution between 1 and N=choose(49,7), which is a relatively big number.
And this is it: this way, I pick up 3 winners among n and there is no way for me (or the players) to rig the selection. Here is the code I’ll run:
Select All Code:1 2 3 4 5 6 lotteryResult < c( , , , , , , ) # to be filled in by the actual lottery draw nPlayers < 200 # to be updated with the number of participants nWinners < 3 index < ceiling(choose(nPlayers, nWinners) * lexicographicIndex(lotteryResult, 49) / choose(49,7)) winners < combn(nPlayers,nWinners,simplify = FALSE)[[index]] cat("\n The winners are:", winners)The deadline for the competition is Wednesday 09th July at midnight UK time. The next lottery (UK lotto) draw after that is on Saturday 12th July, and I’ll use that draw to decide on the winners, using the code I’ve presented here.
What do you think? Can you poke holes in my strategy? Can you come up with something simpler?
Note about the lexicographic index
It is not terribly difficult to find the index of a combination without generating them all. All you need to do is to count the number of combinations that appeared before. For example, if the combination starts with 3, you know it comes after all the combinations that start with 1 and 2. Here is the code I wrote to go from a combination to its lexicographic index. There’s also a test function after it.
To leave a comment for the author, please follow the link and comment on his blog: Christophe Ladroue » R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Introduction to R for Life Scientists: Course Materials
(This article was first published on Getting Genetics Done, and kindly contributed to Rbloggers)
Last week I taught a threehour introduction to R workshop for life scientists at UVA's Health Sciences Library.I broke the workshop into three sections:
In the first half hour or so I presented slides giving an overview of R and why R is so awesome. During this session I emphasized reproducible research and gave a demonstration of using knitr + rmarkdown in RStudio to produce a PDF that can easily be recompiled when data updates.
In the second (longest) section, participants had their laptops out with RStudio open coding along with me as I gave an introduction to R data types, functions, getting help, data frames, subsetting, and plotting. Participants were challenged with an exercise requiring them to create a scatter plot using a subset of the builtin mtcars dataset.
We concluded with an analysis of RNAseq data using the DESeq2 package. We started with a count matrix and a metadata file (the modENCODE pasilla knockout data packaged with DESeq2), imported the data into a DESeqDataSet object, ran the DESeq pipeline, extracted results, and did some basic visualization (MAplots, PCA, volcano plots, etc). A future daylong course will cover RNAseq in more detail (intro UNIX, alignment, & quantitation in the morning; intro R, QC, and differential expression analysis in the afternoon).
I wrote the course materials using knitr, rendered using Jekyll, hosted as a GitHub project page. The rendered course materials can be found at the link below, and the source is on GitHub.
Course Materials: Introduction to R for Life Scientists
Slides:
Cheat Sheet:
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons AttributionShareAlike 3.0 Unported License.
To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Four Simple Turtle Graphs To Play With Kids
(This article was first published on Ripples, and kindly contributed to Rbloggers)
Technology is just a tool: in terms of getting the kids working together and motivating them, the teacher is the most important (Bill Gates)
Some days ago I read this article in Rbloggers and I discovered the TurtleGraphics package. I knew about turtle graphics long time ago and I was thinking of writing a post about them, doing some draw on my own using this technique. In fact, some of my previous drawings could have been done using this package.
Turtle graphics are a good tool to teach kids some very important mathematical and computational concepts such as length, coordinates, location, angles, iterations … Small changes in some of this parameters (especially in angles) can produce very different drawings. Those who want to know more about turtle graphics can read the Wikipedia article.
Following you can find four simple graphics preceded by the code to generate them. Please, send me your creations if you want:
library(TurtleGraphics) turtle_init() turtle_col("gray25") for (i in 1:150) { turtle_forward(dist=1+0.5*i) turtle_right(angle=89.5)} turtle_hide() library(TurtleGraphics) turtle_init() turtle_col("gray25") turtle_right(angle=234) for (i in 1:100) { turtle_forward(dist=0.9*i) turtle_right(angle=144.3)} turtle_hide() library(TurtleGraphics) turtle_init() turtle_col("gray25") turtle_setpos(48,36) d=50 for (i in 1:300) { turtle_forward(dist=d) if (i%%4==0) { turtle_right(angle=75) d=d*.95} else turtle_right(angle=90)} turtle_hide() library(TurtleGraphics) turtle_init() turtle_col("gray25") turtle_setpos(50,35) turtle_right(angle=30) d=25 turtle_setpos(50d/2,50d/2*tan(pi/6)) for (i in 1:100) { turtle_forward(dist=d) d=d+.5 turtle_right(angle=120+1)} turtle_hide()To leave a comment for the author, please follow the link and comment on his blog: Ripples. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...