R bloggers

Syndicate content R-bloggers
R news and tutorials contributed by (600) R bloggers
Updated: 2 hours 58 min ago

asymptotically exact inference in likelihood-free models [a reply from the authors]

Wed, 2016-11-30 18:16

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

[Following my post of lastTuesday, Matt Graham commented on the paper with force détails. Here are those comments. A nicer HTML version of the Markdown reply below is also available on Github.]

Thanks for the comments on the paper!

A few additional replies to augment what Amos wrote:

This however sounds somewhat intense in that it involves a quasi-Newton resolution at each step.

The method is definitely computationally expensive. If the constraint function is of the form of a function from an M-dimensional space to an N-dimensional space, with M≥N, for large N the dominant costs at each timestep are usually the constraint Jacobian (∂c/∂u) evaluation (with reverse-mode automatic differentiation this can be evaluated at a cost of O(N) generator / constraint evaluations) and Cholesky decomposition of the Jacobian product (∂c/∂u)(∂c/∂u)‘ with O(N³) cost (though in many cases e.g. i.i.d. or Markovian simulated data, structure in the generator Jacobian can be exploited to give a significantly reduced cost). Each inner Quasi-Newton update involves a pair of triangular solve operations which have a O(N²) cost, two matrix-vector multiplications with O(MN) cost, and a single constraint / generator function evaluation; the number of Quasi-Newton updates required for convergence in the numerical experiments tended to be much less than N hence the Quasi-Newton iteration tended not to be the main cost.

The high computation cost per update is traded off however with often being able to make much larger proposed moves in high-dimensional state spaces with a high chance of acceptance compared to ABC MCMC approaches. Even in the relatively small Lotka-Volterra example we provide which has an input dimension of 104 (four inputs which map to ‘parameters’, and 100 inputs which map to ‘noise’ variables), the ABC MCMC chains using the coarse ABC kernel radius ϵ=100 with comparably very cheap updates were significantly less efficient in terms of effective sample size / computation time than the proposed constrained HMC approach. This was in large part due to the elliptical slice sampling updates in the ABC MCMC chains generally collapsing down to very small moves even for this relatively coarse ϵ. Performance was even worse using non-adaptive ABC MCMC methods and for smaller ϵ, and for higher input dimensions (e.g. using a longer sequence with correspondingly more random inputs) the comparison becomes even more favourable for the constrained HMC approach.

A lot of the improvement seems to be coming from using gradient information: running standard HMC on the inputs u with a Gaussian ABC kernel which is equivalent to our baseline in the pose and MNIST experiments or the method proposed in the recent Pseudo-marginal Hamiltonian Monte Carlo by Lindsten and Doucet, seems to give comparable performance when normalised for computation time for moderate ϵ. Although the standard HMC updates are much cheaper to compute, the large difference in scale between the change in density in directions normal to the (soft) constraint and in the tangent space of the constraint manifold mean a small step-size needs to be used for the standard HMC updates to maintain reasonable accept rates (with in general the manifold being non-linear and so we cannot adjust for the scale differences by a simple linear transformation / non-identity mass matrix) which means that despite the much cheaper updates standard HMC tends to give similar effective samples per computation time (and as ϵ becomes smaller this approach becomes increasingly less efficient compared to the constrained HMC method).

I also find it surprising that this projection step does not jeopardise the stationary distribution of the process, as the argument found therein about the approximation of the approximation is not particularly deep.

The overall simulated constrained dynamic including the projection step is symplectic on the constraint manifold (as shown in Symplectic numerical integrators in constrained Hamiltonian systems by Leimkuhler and Reich) and time reversible, providing the projection iteration is run to convergence and an ‘appropriate’ step size is chosen (i.e. sufficiently small compared to the curvature of the manifold that there is guaranteed to be a solution to the projection on to the manifold and that there is only one close by solution to the projection step that will be converge to). The issues with requiring an appropriate step size to ensure a solution to the non-linear equations being solved exists and is locally unique are the same as in the implicit integrators used in Riemannian-manifold HMC methods.

In practice we found the use of the geodesic integration scheme which splits up the unforced motion on the constrained manifold in to multiple smaller steps for each outer forced step helped in allowing an appropriate step-size for the curvature of the manifold to be chosen independently to that appropriate for the change in the density. Providing an appropriately small step size was used non-convergence was very rarely an issue and generally only in the initial updates where the geometry of the constraint manifold might be expected to be non-typical of that seen after warm-up.

But the main thing that remains unclear to me after reading the paper is how the constraint that the pseudo-data be equal to the observable data can be turned into a closed form condition like g⁰(u)=0.

For concreteness I’ll assume a parameter inference problem with parameters θ generated from some prior distribution given a vector of standard variates u¹ i.e. θ=ρ(u¹) where ρ is a deterministic function.

Further let f(θ,u²) be our simulator function which given access to a further vector of variates from a random number generator u² and parameters θ=ρ(u¹) can produce simulated data x=f(ρ(u¹),u²).

If we have observed data x˘ then the constraint that x=x˘ can be written f(ρ(u¹),u²)−x˘=0, so we would have c(u)=f(ρ(u¹),u²)−x˘=0 where u=[u¹;u²].

As mentioned above, the authors assume a generative model based on uniform (or other simple) random inputs but this representation seems impossible to achieve in reasonably complex settings.

The representation requirements can be split in to two components:

  1. We can express our simulator model in the form y=g(u)
  2. g is differentiable with respect to u

The assumed differentiability of the generator is definitely a strong restriction, and does limit the models which this can be applied to.

I’d argue that its quite rare that in the context of simulator type models that the first assumption isn’t valid (though this may be a somewhat circular argument as it could just be what I’m defining as a simulator model is something which it applies to!).

All (1) requires is that in the simulator code all ‘randomness’ is introduced by drawing random variates from a (pseudo-)random number generator where the density of the drawn variates is known (up to some potentially unknown normalisation constant). More explicitly for the parameter inference setting above if we can write our generator function in the form

 

def generator(rng): params = generate_from_prior(rng) simulated_data = simulator(rng, params) return [params, simulated_data]

where rng is a pseudo-random number generator object allowing independent samples to be generated from standard densities then this assumption holds.

For (1) alone to hold the code in this function can be completely arbitrary . This could be code numerically integrating a set of stochastic differential equations, a graphics rendering pipeline or a learned parametric ‘neural network’ type model (with the three experiments in our paper providing toy examples of each of these).

There are degrees of freedom in what to consider the random inputs. I believe pseudo-random number generator implementations generally generate continuous random variates from just a pseudo-random uniform primitive (which itself will be generated from a pseudo-random integer primitive), e.g. RandomKit which is used for the random number implementation in (amongst other libraries) NumPy provides a rk_double function to generate a double-precision pseudo-random uniform variate in [0, 1) (and a rk_gauss function to generate a pair of Gaussian random variates but this uses rk_double internally). So assuming u is an arbitrarily long vector of random uniform variates will often be sufficient for allowing the simulator to be expressed as in (1) as mentioned by Dennis Prangle in the comments of your recent post ‘rare events for ABC’.

In general our proposed method is actually more suited to using random inputs with unbounded support otherwise it is necessary to deal with reflections at the intersection of the constraint manifold and bounds of the support (which is possible while maintaining reversibility but a bit painful implementation wise), so for example it is better to use Gaussian variates directly than specify uniform inputs then transform to Gaussians. A uniform variate might be generated by for example setting the base density to the logistic distribution and then transforming through the logistic sigmoid. This problem of automatically transforming to unconstrained inputs has been well studied in for example Stan.

Returning to the limitations applied by assuming (2) i.e. differentiability: some of the transforms from uniform variates used to generate random variates in random number generator implementations are non-differentiable e.g. if using an accept / reject step, for example common methods for generating a Gamma variate. There are a couple of options here. We can often just use the density of the output of the transform itself e.g. a Gamma base density on the relevant ui; if the parameters of the variate distribution are themselves dependent on other random inputs we need to make sure to include this dependency, but its possible to track these dependencies automatically – again probabilistic programming frameworks like Stan do just this. In other cases we might be able to use alternative (potentially less computationally efficient) transformations that avoid the accept/reject step e.g. using the original Box-Muller transform versus the more efficient but rejection based polar variant, or use tricks such as in the recent Rejection Sampling Variational Inference by Nasesseth et al.

Filed under: R, Statistics Tagged: ABC, ABC-MCMC, arXiv, Edinburgh, generator, Hamiltonian, HMC, Jacobian, likelihood-free methods, Lotka-Volterra, manifold, normalisation, pseudo-marginal MCMC, quasi-Newton resolution, reply, Riemann manifold, simulation under restrictions, simulator model

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Frequency and chi-square test for independence Exercises

Wed, 2016-11-30 12:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this exercise, we cover some basics on frequency tables. We also briefly look at chi-square test for independence to find relationships of two variables. Before proceeding, it might be helpful to look over the help pages for the table, summary, and margin.table functions.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
use the attach() command to load the trees dataset in R

Exercise 2
Use the table() command with the arguments: trees$Height and trees$Volume. This will generate a two-way frequency table. Store this in variable mytable.

Exercise 3
If you are familiar with excel pivot tables, then you will know this function. Use the margin.table( ) function to get the Height frequencies summed over Volume

Exercise 4
Use the margin.table( ) function to get the Volume frequencies summed over Height.

Exercise 5
Now use the table() function again but using all the features of the trees dataset, that includes girth, height and volume. This will print out a multidimensional 3 way frequency table.

Exercise 6
Suppose you have a variable ‘a’ that stores a second sample of heights of trees.

a=c(70, 65, 63, 72, 80, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75, 74, 85, 8, 71, 63, 78, 80, 74, 72, 77, 81, 82, 80, 86, 80, 87)

Use the cbind() to add the a column to your trees dataset. Store the results back into trees.

Exercise 7
Now create a 2 way frequency table between Height and a as the arguments. Store this table in mytable_2.

Exercise 8

Use the margin.table() function again from Q3 and get Height frequencies summer over a. What differences do you observe from the results of Q3.

Exercise 9
Chi Square test for independance:
a)Print the results of the summary() function on mytable. Note the Chi Square test for independance results and P value
b)Print the results of the summary() function on mytable_2. Note the Chi Square test for independance results and P value.

Exercise 10
What did the chi square test for independance help you to see?

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Missing Values, Data Science and R

Wed, 2016-11-30 12:00

(This article was first published on RStudio, and kindly contributed to R-bloggers)

by Joseph Rickert

One great advantages of working in R is the quantity and sophistication of the statistical functions and techniques available. For example, R’s quantile() function allows you to select one of the nine different methods for computing quantiles. Who would have thought there could be so many ways to do something that seems to be so simple? The issue here is not unnecessary complication, but rather an appreciation of the nuances associated with inference problems gained over the last hundred years of modern statistical practice. Much of this knowledge is reflected in R. As I imagine it, if R’s functions were laid out like city streets there would be many broad avenues displaying commonly used algorithms, but these would flow into many thousands of small, rarely-visited alleys, containing treasures for those who seek them out.

My impression is that most data science is done on the avenues, but that data scientists could benefit from occasionally venturing deeper into the alleys of statistical practice. Statistics might be the least important part of data science but there are times when coming to grips with sound statistical inference is essential. Consider the problem of dealing with missing values for example. It is difficult to imagine any large, real-world data set that wouldn’t require a strategy for imputing missing values. The obvious first step in developing a strategy would be to form some ideas about why the data are missing. Gelman and Hill identify four different “missingness mechanisms”: (1) Missingness completely at random, (2) Missingness at random, (3) Missingness that depends on unobserved predictors, (4) Missingness that depends on the missing value itself; and they provide some advice on how to cope with each of them. In a large data set, there is no reason to believe that there is just one mechanism in play! Evaluating your data with respect to these categories and their combinations could require a frightening amount of exploratory work. So tools for looking at patterns in missing values are likely to be very helpful, even if using them requires sampling your data set.

The next step of deciding how to proceed in a statistically sound manner will likely pose considerable technical challenges, and opting for simple solutions may not be advisable, or even possible. Even with big data, ignoring observations with missing values could result in a catastrophic loss of information, and simple approaches to imputation such as replacing missing the missing values of each variable with the variable mean, or a common value, will produce a “completed” data set reflecting an artificial amount of certainty that will likely underestimate the variability of the data and bias results. R can’t eliminate the hard work, but it does provide a formidable array of missing value imputation tools that can get you in and out of the statistical alleys and help you to decide what techniques might be useful.

Before evaluating the various methods, it is helpful to make a couple of distinctions about imputation methods for multivariate data. The first is between single and multiple imputation. In single imputation, a particular algorithm or technique is used to produce a single estimate for each missing value. Multiple imputation methods, on the other hand, develop distributions for missing values and estimate individual missing values by drawing from this distribution. In general, these algorithms proceed in three steps. In the first imputation step, multiple draws are made from the distribution of missing values for each variable. This process results in several “completed” data sets where each missing value has been estimated by a plausible value. In the second step, the intended analysis is performed on each completed data set. In the last step, a “pooling”algorithm estimates the final values for the statistics of interest.

The second distinction is between Joint Modeling (JM) and Fully Conditional Specification (FCS) imputing. In JM, the data are assumed to follow a multivariate parametric distribution of some sort. This is theoretically sound but may not be flexible enough to adequately model the data. In FCS, the multivariate data model is specified by developing a separate conditional model for each variable with missing values.

Here is a short list of R packages for missing value imputation. I have selected these to give some idea of the variety of tools available.

Amelia implements the Amelia II algorithm which assumes that the complete data set (missing and observed data) are multivariate normal. Imputations are done via the EMB (expectation-maximization with bootstrapping) algorithm. The JSS paper describes a strategy for combining the models resulting from each imputed data set. The Amelia vignette contains examples.

BaBoon provides two variants of the the Bayesian Bootstrap predictive mean matching to impute multiple missing values. Originally developed for survey data, the imputation algorithms are described as being robust with respect to imputation model misspecification. The best description and rationale for the algorithms seems to be the PhD thesis of one of the package authors.

Hmisc contains several functions that are helpful for missing value imputation including agreImpute(), impute() and transcan(). Documentation on Hmisc can be found here.

mi takes a Bayesian approach to imputing missing values. The imputation algorithm runs multiple MCMC chains to iteratively draw imputed values from conditional distributions of observed and imputed data. In addition to imputation algorithm, the package contains functions for visualizing the pattern of missing values in a data set and assessing the convergence of the MCMC chains. A vignette shows a worked example and the associated JSS paper delves deeper into the theory and the mechanics of using the method.

mice which is an acronym for multivariate imputation of chained equations, formalizes the multiple implementation process outline above and is probably the gold standard for FCS multiple imputation. Package features include:

  • Columnwise specification of the imputation model
  • Support for arbitrary patterns of missing data
  • Passive imputation techniques that maintain consistency among data transformations
  • Subset selection of predictors
  • Support of arbitrary complete-data methods
  • Support pooling various types of statistics
  • Diagnostics for imputations
  • Callable user-written imputation functions

The JSS paper describes how the desire to provide a separate imputation model for each variable led to the development of the chained equation technique where a Gibbs sampler fills out the missing data. miceAdds provides additional functions to be used with mice including plausible value imputation, multilevel imputation functions, imputation using partial least squares (PLS) for high dimensional predictors, nested multiple imputation, and two-way imputation.

missingDataGUI implements a nice graphical interface for exploring missing data patterns with numeric and graphical summaries for numeric and categorical missing values and implements a number of imputation methods. The figure below shows the missing value map for the HouseVotes84 data set in the mlbench package.

missMDA performs principal component methods on incomplete data sets obtaining scores, loadings and graphical representations despite missing values. The package also includes functions to perform single and multiple imputation. The JSS paper provides the details.

VIM provides tools for visualizing missing or imputed values. Before imputation they can be used to study the pattern of missing values, afterwards these same tools can be used as diagnostics. VIMGUI puts a front end on the VIM functions and helps with handling the plot and imputation functions. The vignette is thorough and provides some nice examples of how one might look at missing value distributions.

vtreat provides tools to assist in the statistically sound preparation of data sets. It is not a package explicitly devoted to missing value imputation, but it can produce “cleaned” data sets that have no “Infinite/NA/NaN in the effective variable columns”. I include it here to emphasize that proper data preparation can simplify the missing value problem. The package has several vignettes.

yaImpute takes what might be thought of as a machine learning approach to imputing missing values by using the k-nearest neighbor (kNN) algorithm to impute missing values. The JSS paper covers the theory and explains the package using a forestry application.

For additional R packages see Stef van Buuren’s webpage cataloging software for imputation which lists twelve R packages that implement some method of single imputation and eighteen R packages concerned with multiple imputation and provides a brief explanation of each of these. Also, have a look on the post on analyticsvidha.com that provides informative short write ups on amelia, Hmisc, mi, mice and missForest that include some sample code.

I also recommend looking at Stef van Buuren presentation on the history and future of Fully Conditional Specification from the 2015 Rennes missing value conference and Julie Josse’s tutorial at useR! 2016.

Even with the best of tools there is no doubt that dealing with missing values in a statistically sound manner is difficult. However, this is the that kind of work that helps to put the “science” in data science, and R is the most helpful environment you are likely to find.

To leave a comment for the author, please follow the link and comment on their blog: RStudio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Jupyter And R Markdown: Notebooks With R

Wed, 2016-11-30 11:52

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

When working on data science problems, you might want to set up an interactive environment to work and share your code for a project with others. You can easily set this up with a notebook. 

In other cases, you’ll just want to communicate about the workflow and the results that you have gathered for the analysis of your data science problem. For a transparent and reproducible report, a notebook can also come in handy.

That’s right; notebooks are perfect for situations where you want to combine plain text with rich text elements such as graphics, calculations, etc.

The topic of today’s blog post focuses on the two notebooks that are popular with R users, namely, the Jupyter Notebook and, even though it’s still quite new, the R Markdown Notebook. You’ll discover how to use these notebooks, how they compare to one another and what other alternatives exist. 

R And The Jupyter Notebook

Contrary to what you might think, Jupyter doesn’t limit you to working solely with Python: the notebook application is language agnostic, which means that you can also work with other languages. 

There are two general ways to get started on using R with Jupyter: by using a kernel or by setting up an R environment that has all the essential tools to get started on doing data science.

Running R in Jupyter With The R Kernel

As described above, the first way to run R is by using a kernel. If you want to have a complete list of all the available kernels in Jupyter, go here.

To work with R, you’ll need to load the IRKernel and activate it to get started on working with R in the notebook environment.

First, you’ll need to install some packages. Make sure that you don’t do this in your RStudio console, but in a regular R terminal, otherwise you’ll get an error like this:

Error in IRkernel::installspec() : Jupyter or IPython 3.0 has to be installed but could neither run “jupyter” nor “ipython”, “ipython2” or “ipython3”. (Note that “ipython2” is just IPython for Python 2, but still may be IPython 3.0) $ R > install.packages(c('repr', 'IRdisplay', 'evaluate', 'crayon', 'pbdZMQ', 'devtools', 'uuid', 'digest'))

This command will prompt you to type in a number to select a CRAN mirror to install the necessary packages. Enter a number and the installation will continue.

> devtools::install_github('IRkernel/IRkernel')

Then, you still need to make the R kernel visible for Jupyter:

# Install IRKernel for the current user > IRkernel::installspec() # Or install IRKernel system-wide > IRkernel::installspec(user = FALSE)

Now open up the notebook application with jupyter notebook. You’ll see R appearing in the list of kernels when you create a new notebook. 

Using An R Essentials Environment In Jupyter

The second option to quickly work with R is to install the R essentials in your current environment:

conda install -c r r-essentials

These “essentials” include the packages dplyrshinyggplot2tidyrcaret, and nnet. If you don’t want to install the essentials in your current environment, you can use the following command to create a new environment just for the R essentials:

conda create -n my-r-env -c r r-essentials

Now open up the notebook application to start working with R.

You might wonder what you need to do if you want to install additional packages to elaborate your data science project. After all, these packages might be enough to get you started, but you might need other tools.

Well, you can either build a Conda R package by running, for example:

conda skeleton cran ldavis conda build r-ldavis/

Or you can install the package from inside of R via install.packages() or devtools::install_github (to install packages from GitHub). You just have to make sure to add the new package to the correct R library used by Jupyter:

install.packages("ldavis", "/home/user/anaconda3/lib/R/library")

If you want to know more about kernels or about running R in a Docker environment, check out this page.

Adding Some R Magic To Jupyter

A huge advantage of working with notebooks is that they provide you with an interactive environment. That interactivity comes mainly from the so-called “magic commands”.

These commands allow you to switch from Python to command line instructions or to write code in another language such as R, Julia, Scala, …

To switch from Python to R, you first need to download the following package:

%load_ext rpy2.ipython

After that, you can get started with R, or you can easily switch from Python to R in your data analysis with the %R magic command.

Let’s demonstrate how the R magic works with a small example:

# Hide warnings if there are any import warnings warnings.filterwarnings('ignore') # Load in the r magic %load_ext rpy2.ipython # We need ggplot2 %R require(ggplot2) # Load in the pandas library import pandas as pd # Make a pandas DataFrame df = pd.DataFrame({'Alphabet': ['a', 'b', 'c', 'd','e', 'f', 'g', 'h','i'],                    'A': [4, 3, 5, 2, 1, 7, 7, 5, 9],                    'B': [0, 4, 3, 6, 7, 10,11, 9, 13],                    'C': [1, 2, 3, 1, 2, 3, 1, 2, 3]}) # Take the name of input variable df and assign it to an R variable of the same name %%R -i df # Plot the DataFrame df ggplot(data=df) + geom_point(aes(x=A, y=B, color=C))

If you want more details about Jupyter, on how to set up a notebook, where to download the application, how you can run the notebook application (via Docker, pip install or with the Anaconda distribution) or other details, check out our Definitive Guide.

The R Notebook

Up until recently, Jupyter seems to have been a popular solution for R users, next to notebooks such as Apache Zeppelin or Beaker.

Also, other alternatives to report results of data analyses, such as R Markdown, Knitr or Sweave, have been hugely popular in the R community.

However, this might change with the recent release of the R or R Markdown Notebook by RStudio.

You see it: the context of the R Markdown Notebook is complex, and it’s worth looking into the history of reproducible research in R to understand what drove the creation and development of this notebook. Ultimately, you will also realize that this notebook is different from others. 

R And The History of Reproducible Research

In his talk, J.J Allaire, confirms that the efforts in R itself for reproducible research, the efforts of Emacs to combine text code and input, the Pandoc, Markdown and knitr projects, and computational notebooks have been evolving in parallel and influencing each other for a lot of years. He confirms that all of these factors have eventually led to the creation and development of notebooks for R.

Firstly, computational notebooks have quite a history: since the late 80s, when Mathematica’s front end was released, there have been a lot of advancements. In 2001, Fernando Pérez started developing IPython, but only in 2011 the team released the 0.12 version of IPython was realized. The SageMath project began in 2004. After that, there have been many notebooks. The most notable ones for the data science community are the Beaker (2013), Jupyter (2014) and Apache Zeppelin (2015). 

Then, there are also the markup languages and text editors that have influenced the creation of RStudio’s notebook application, namely, Emacs, Markdown, and Pandoc. Org-mode was released in 2003. It’s an editing and organizing mode for notes, planning and authoring in the free software text editor Emacs. Six years later, Emacs org-R was there to provide support for R users. Markdown, on the other hand, was released in 2004 as a markup language that allows you to format your plain text in such a way that it can be converted to HTML or other formats. Fast forward another couple of years, and Pandoc was released. It’s a writing tool and as a basis for publishing workflows.

Lastly, the efforts of the R community to make sure that research can be reproducible and transparent have also contributed to the rise of a notebook for R. 2002, Sweave was introduced in 2002 to allow the embedding of R code within LaTeX documents to generate PDF files. These pdf files combined the narrative and analysis, graphics, code, and the results of computations. Ten years later, knitr was developed to solve long-standing problems in Sweave and to combine features that were present in other add-on packages into one single package. It’s a transparent engine for dynamic report generation in R. Knitr allows any input languages and any output markup languages.

Also in 2012, R Markdown was created as a variant of Markdown that can embed R code chunks and that can be used with knitr to create reproducible web-based reports. The big advantage was and still is that it isn’t necessary anymore to use LaTex, which has a learning curve to learn and use. The syntax of R Markdown is very similar to the regular Markdown syntax but does have some tweaks to it, as you can include, for example, LaTex equations.

R Markdown Versus Computational Notebooks

R Markdown is probably one of the most popular options in the R community to report on data analyses. It’s no surprise whatsoever that it is still a core component in the R Markdown Notebook. 

And there are some things that R Markdown and notebooks share, such as the delivering of a reproducible workflow, the weaving of code, output, and text together in a single document, supporting interactive widgets and outputting to multiple formats. However, they differ in their emphases: R Markdown focuses on reproducible batch execution, plain text representation, version control, production output and offers the same editor and tools that you use for R scripts.

On the other hand, the traditional computational notebooks focus on outputting inline with code, caching the output across sessions, sharing code and outputting in a single file. Notebooks have an emphasis on an interactive execution model. They don’t use a plain text representation, but a structured data representation, such as JSON.

That all explains the purpose of RStudio’s notebook application: it combines all the advantages of R Markdown with the good things that computational notebooks have to offer.

That’s why R Markdown is a core component of the R Markdown Notebook: RStudio defines its notebook as “an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input”.

How To Work With R Notebooks

If you’ve ever worked with Jupyter or any other computational notebook, you’ll see that the workflow is very similar. One thing that might seem very different is the fact that now you’re not working with code cells anymore by default: you’re rather working with a sort of text editor in which you indicate your code chunks with R Markdown.

How To Install And Use The R Markdown Notebook

The first requirement to use the notebook is that you have the newest version of RStudio available on your PC. Since notebooks are a new feature of RStudio, they are only available in version 1.0 or higher of RStudio. So, it’s important to check if you have a correct version installed.

If you don’t have version 1.0 or higher of RStudio, you can download the latest version here.

Then, to make a new notebook, you go to File tab, select“New File”, and you’ll see the option to create a new R Markdown Notebook. If RStudio prompts you to update some packages, just accept the offer and eventually a new file will appear.

Tip: double-check whether you’re working with a notebook by looking at the top of your document. The output should be html_notebook.

You’ll see that the default text that appears in the document is in R Markdown. R Markdown should feel pretty familiar to you, but if you’re not yet quite proficient, you can always check out our Reporting With R Markdown course or go through the material that is provided by RStudio.

Note that you can always use the gear icon to adjust the notebook’s working space: you have the option to expand, collapse, and remove the output of your code, to change the preview options and to modify the output options.

This last option can come in handy if you want to change the syntax highlighting, apply another theme, adjust the default width and height of the figures appearing in your output, etc.

From there onwards, you can start inserting code chunks and text!

You can add code chunks in two ways: through the keyboard shortcut Ctrl + Alt + I or Cmd + Option + I or with the insert button that you find in the toolbar. 

What’s great about working with these R Markdown notebooks is the fact that you can follow up on the execution of your code chunks, thanks to the little green bar that appears on the left when you’re executing large code chunks or multiple code chunks at once. Also, note that there’s a progress bar on the bottom.

You can see the green progress bar appearing in the gif below:

Talking about code execution: there are multiple ways in which you can execute your R code chunks.

You can run a code chunk or run the next chunk, run all code chunks below and above; but you can also choose to restart R and run all chunks or to restart and to clear the output.

Note that when you execute the notebook’s code, you will also see the output appearing on your console! That might be a rather big difference for those who usually work with other computational notebooks such as Jupyter.

If there are any errors while the notebook’s code chunks are being executed, the execution will stop, and there will appear a red bar alongside the code piece that produces the error.

You can suppress the halt of the execution by adding errors = TRUE in the chunk options, just like this:

```{r, error=TRUE} iris <- read.csv(url("http://mlr.cs.umass.edu/ml/machine-leaning-databases/"), header = FALSE) names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") ```

Note that the error will still appear, but that the notebook’s code execution won’t be halted!

How To Use R Markdown Notebook’s Magic

Just like with Jupyter, you can also work interactively with your R Markdown notebooks. It works a bit differently from Jupyter, as there are no real magic commands; To work with other languages, you need to add separate Bash, Stan, Python, SQL or Rcpp chunks to the notebook. 

These options might seem quite limited to you, but it’s compensated in the ease with which you can easily add these types of code chunks with the toolbar’s insert button.

Also working with these code chunks is easy: you can see an example of SQL chunks in this document, published by J.J Allaire. For Bash commands, you just type the command. There’s no need extra characters such as ‘!’ to signal that you’re working in Bash, like you would do when you would work with Jupyter. 

How To Output Your R Markdown Notebooks 

Before you render the final version of a notebook, you might want to preview what you have been doing. There’s a handy feature that allows you to do this: you’ll find it in your toolbar. 

Click on the “preview” button and the provisional version of your document will pop up on the right-hand side, in the “Viewer” tab.

By adding some lines to the first section on top of the notebook, you can adjust your output options, like this:

--- title: "Notebook with KNN Example" output: pdf_document: highlight: tango toc: yes html_notebook: toc: yes ---

To see where you can get those distributions, you can just try to knit, and the console output will give you the sites where you can download the necessary packages. 

Note that this is just one of the many options that you have to export a notebook: there’s also the possibility to render GitHub documents, word documents, beamer presentation, etc. These are the output options that you already had with regular R Markdown files. You can find more info here

Tips And Tricks To Work With R Notebook

Besides the general coding practices that you should keep in mind, such as documenting your code and applying a consistent naming scheme, code grouping and name length, you can also use the following tips to make a notebook awesome for others to use and read:

  • Just like with computational notebooks, it might be handy to split large code chunks or code chunks that generate more than one output into multiple chunks. This way, you will improve the general user experience and increase the transparency of a notebook.
  • Make use of the keyboard shortcuts to speed up your work. You will find most of them in the toolbar, next to the commands that you want to perform. 
  • Use the spellchecker in the toolbar to make sure your report’s vocabulary is correct. 
  • Take advantage of the option to hide your code if a notebook is code-heavy. You can do this through code chunk options or in the HTML file of the notebook itself!
The R Notebook Versus The Jupyter Notebook

Besides the differences between the Jupyter and R Markdown notebooks that you have already read above, there are some more things.

Let’s compare Jupyter with the R Markdown Notebook!

There are four aspects that you will find interesting to consider: notebook sharing, code execution, version control, and project management.

Notebook Sharing

The source code for an R Markdown notebook is an .Rmd file. But when you save a notebook, an .nb.html file is created alongside it. This HTML file is an associated file that includes a copy of the R Markdown source code and the generated output. 

That means that you need no special viewer to see the file, while you might need it to view notebooks that were made with the Jupyter application, which are simple JSON documents, or other computational notebooks that have structured format outputs. You can publish your R Markdown notebook on any web server, GitHub or as an email attachment.

There also are APIs to render and parse R Markdown notebooks: this gives other frontend tools the ability to create notebook authoring modes for R Markdown. Or the APIs can be used to create conversion utilities to and from different notebook formats.

To share the notebooks you make in the Jupyter application, you can export the notebooks as slideshows, blogs, dashboards, etc. You can find more information in this tutorial. However, there are also the default options to generate Python scripts, HTML files, Markdown files, PDF files or reStructured Text files.

Code Execution

R Markdown Notebooks have options to run a code chunk or run the next chunk, run all code chunks below and above; In addition to these options, you can also choose to restart R and run all chunks or to restart and to clear the output.

These options are interesting when you’re working with R because the R Markdown Notebook allows all R code pieces to share the same environment. However, this can prove to be a huge disadvantage if you’re working with non-R code pieces, as these don’t share environments.

All in all, these code execution options add a considerable amount of flexibility for the users who have been struggling with the code execution options that Jupyter offers, even though if these are not too much different: in the Jupyter application, you have the option to run a single cell, to run cells and to run all cells. You can also choose to clear the current or all outputs. The code environment is shared between code cells.

Version control

There have been claims that Jupyter messes up the version control of notebooks or that it’s hard to use git with these notebooks. Solutions to this issue are to export the notebook as a script or to set up a filter to fix parts of the metadata that shouldn’t change when you commit or to strip the run count and output.

The R Markdown notebooks seem to make this issue a bit easier to handle, as they have associated HTML files that save the output of your code and the fact that the notebook files are essentially plain text files, version control will be much easier. You can choose to only put your .Rmd file on GitHub or your other versioning system, or you can also include the .nb.html file.

Project Management

As the R Markdown Notebook is native to the RStudio development kit, the notebooks will seamlessly integrate with your R projects. Also, these notebooks support other languages, including Python, C, and SQL.

On the other hand, the Jupyter project is not native to any development kit: in that sense, it will cost some effort to integrate this notebook seamlessly with your projects. But this notebook still supports more languages and will be a more suitable companion for you if you’re looking for use Scala, Apache Toree, Julia, or another language. 

Alternatives to Jupyter or R Markdown Notebooks

Apart from the notebooks that you can use as interactive data science environments which make it easy for you to share your code with colleagues, peers, and friends, there are also other alternatives to consider.

Because sometimes you don’t need a notebook, but a dashboard, an interactive learning platform or a book, for example.

You have already read about options such as Sweave and Knitr in the second section. Some other options that are out there, are:

  • Even though this blog post has covered R Markdown to some extent, you should know that you can do so much more with it. For example, you can build dashboards with flexdashboard.
  • Or you can use Bookdown to quickly publish HTML, PDF, ePub, and Kindle books with R Markdown. 
  • Shiny is a tool that you can also use to create dashboards. To get started with Shiny, go to this page.
  • In an educational setting, DataCamp Light might also come in handy to create interactive tutorials on your blog or website. If you want to see DataCamp light at work, go to this tutorial, for example.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Microsoft R Open 3.3.2 now available

Wed, 2016-11-30 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Microsoft R Open 3.3.2, Microsoft's enhanced distribution of open source R, is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to version 3.3.2, adds new bundled packages and updates others, and upgrades the Intel Math Kernel Libraries.

The updated R 3.3.2 engine includes some performance improvements (particularly in calculation of eigenvalues), better handling of date axes in graphics, and improved documentation for the methods package. (See here for a complete list of fixes.) There are no user-visible changes in the language itself, which means that scripts and packages should work without changes from MRO 3.3.1.

This update to MRO comes bundled with some additional packages, notably jsonlite (for fast processing of JSON-formatted data), png (to support reading and writing of PNG images), R6 (allowing the creation of classes with reference semantice), and curl (for connecting to web-based data sources). The checkpoint and deployrRserve packages have also been updated.

The MKL libraries, which provide high-performance matrix and vector calculations to MRO, have also been updated. (This fixes some issues with matrix multiplication and arima that were reported.) 

For reproducibility, installing packages with MRO 3.3.2 will by default get you package versions as of November 1, 2016. Many packages have been updated or released since MRO 3.3.1. (See here for some highlights of new packages.) As always, if you want to use newer versions of CRAN packages (or access older versions, for reproducibility), just use the checkpoint function to access the CRAN Time Machine

MRO is supported on Windows, Mac and Linux. This release adds support for two new platforms: MacOS Sierra (10.12) and Ubuntu 16.04. 

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. To download Microsoft R Open (it's free!), simply follow the link below.

MRAN: Download Microsoft R Open

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Package ggguitar on CRAN

Wed, 2016-11-30 08:37

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

The initial release of ggguitar is available on CRAN!  This package allows you to create guitar tablature in the style of ggplot2.  As a quick example, the C Major chord shown above can be created as follows.

library(ggguitar)
C_M <- 0="" 1="" 2="" 3="" b="" c="">
tablature(‘C Major’, C_M)

The tablature function takes the name of the chord for the first argument and a six element vector representing the six strings of the guitar.  An NA indicates a string that is not played, zero represents an open string, and a number indicates a fret where a finger is to be placed on the specified string.

The “String” and “Fret” labels and corresponding axis ticks are optional.

tablature(‘C Major’, C_M, FALSE, FALSE)

See the package at CRAN, the vignette, or the code at Github for more information.

To leave a comment for the author, please follow the link and comment on their blog: R-Chart. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

BERT: a newcomer in the R Excel connection

Wed, 2016-11-30 07:34

A few months ago a reader point me out this new way of connecting R and Excel. I don’t know for how long this has been around, but I never came across it and I’ve never seen any blog post or article about it. So I decided to write a post as the tool is really worth it and before anyone asks, I’m not related to the company in any way.

BERT stands for Basic Excel R Toolkit. It’s free (licensed under the GPL v2)  and it has been developed by Structured Data LLC. At the time of writing the current version of BERT is 1.07. More information can be found here. From a more technical perspective, BERT is designed to support running R functions from Excel spreadsheet cells. In Excel terms, it’s for writing User-Defined Functions (UDFs) in R.

In this post I’m not going to show you how R and Excel interact via BERT. There are very good tutorials  here, here and here. Instead I want to show you how I used BERT to build a “control tower” for my trading.

How do I use BERT?

My trading signals are generated using a long list of R files but I need the flexibility of Excel to display results quickly and efficiently. As shown above BERT can do this for me but I also want to tailor the application to my needs. By combining the power of XML, VBA, R and BERT I can create a good looking yet powerful application in the form of an Excel file with minimum VBA code. Ultimately I have a single Excel file gathering all the necessary tasks to manage my portfolio: database update, signal generation, orders submission etc… My approach  could be broken down in the 3 steps below:

  1. Use XML to build user defined menus and buttons  in an Excel file.
  2. The above menus and buttons are essentially calls to VBA functions.
  3. Those VBA functions are wrapup around R functions defined using BERT.

With this approach I can keep a clear distinction between the core of my code kept in R, SQL and Python and everything used to display and format results kept in Excel, VBA & XML. In the next sections I present the prerequisite to developed such an approach and a step by step guide that explains how BERT could be used for simply passing data from R to Excel with minimal VBA code.

Prerequisite

1 – Download and install BERT from this link. Once the installation has completed you should have a new Add-Ins menu in Excel with the buttons as shown below. This is how BERT materialized in Excel.

2 – Download and install Custom UI editor: The Custom UI Editor allows to create user defined menus and buttons in Excel ribbon. A step by step procedure is available here.

Step by step guide

1 – R Code: The below R function is a very simple piece of code for illustration purposes only. It calculates and return the residuals from a linear regression. This is what we want to retrieve in Excel. Save this in a file called myRCode.R (any other name is fine) in a directory of your choice.

myFunction <- function(){ aa <- rnorm(200) bb <- rnorm(200) res <- lm(aa~bb)$res return(res) }

2 – functions.R in BERT: From Excel select Add-Ins -> Home Directory and open the file called functions.R. In this file paste the following code. Make sure you insert the correct path.

source("D:\\myPath\\myRCode.R")

This is just sourcing into BERT the R file you created above. Then save and close the file functions.R. Should you want to make any change to the R file created in step 1 you will have to reload it using the BERT button “Reload Startup File” from the Add-Ins menu in Excel

3 – In Excel: Create and save a file called myFile.xslm (any other name is fine). This is a macro-enabled file that you save in the directory of your choice. Once the file is saved close it.

4 – Open the file created above in Custom UI editor: Once the file is open, paste the below code.

<customUI xmlns="http://schemas.microsoft.com/office/2009/07/customui"> <ribbon startFromScratch="false"> <tabs> <tab id="RTrader" label="RTrader"> <group id="myGroup" label="My Group"> <button id="button1" label="New Button" size="large" onAction="myRCode" imageMso="Chart3DColumnChart" /> </group> </tab> </tabs> </ribbon> </customUI>

You should have something like this in the XML editor:

Essentially this piece of XML code creates an additional menu (RTrader), a new group (My Group) and a user defined button (New Button) in the Excel ribbon. Once you’re done, open myFile.xslm in Excel and close the Custom UI Editor. You should see something like this.

5 – Open VBA editor: In myFile.xlsm insert a new module. Paste the code below in the newly created module.

Sub myRCode(control As IRibbonControl) Dim a As Variant Dim theLength As Integer ThisWorkbook.Sheets("Sheet1").Range("B1:B10000").ClearContents a = Application.Run("BERT.Call", "myFunction") theLength = UBound(a, 1) + 1 ThisWorkbook.Sheets("Sheet1").Range("B1:B" & theLength).Value = a End Sub

This erases previous results in the worksheet prior to coping new ones.

6 – Click New Button: Now go back to the spreadsheet and in the RTrader menu click the “New Button” button. You should see something like the below appearing.

You’re done!

The guide above is a very basic version of what can be achieved using BERT but it shows you how to combine the power of several specific tools to build your own custom application. From my perspective the interest of such an approach is the ability to glue together R and Excel obviously but also to include via XML (and batch) pieces of code from Python, SQL and more. This is exactly what I needed. Finally I would be curious to know if anyone has any experience with BERT?

Categories: Methodology Blogs

How to create a ggplot Theme – Unicorn Edition

Tue, 2016-11-29 19:00

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

Themes are an convenient way to give ggplot charts an individualized, sometimes stylish look. Most of the time, I rely on the ggthemes package and the Economist style. Last week colleagues asked me to change the look of my charts. We joked around and I agreed to create a unicorn ggplot theme. I want to use the challenge to detail a) how to create custom ggplot themes and b) look at unicorn startup data.

I took the Unicorn startup dataset, which contains startups with a valuation above 1 bil$.
For the theme, in the first step, I changed some basic parameters (coloring, no ticks axis.ticks=element_blank(), no legend title legend.title = element_blank() and the same background for legend elements as well as the legend background legend.key = element_rect(fill = "lightskyblue1", color = "lightskyblue1")). For the impatient reader; the complete code is at the end of the post.

ggplot(dd, aes(year, N, color=Industry, fill=Industry)) + geom_bar(stat="identity") + xlab("") + ylab("Number of Unicorns") + theme_unicorn(base_size = 5)

Data-wise, we see that the number of unicorn valuations massively increased in recent years. Most unicorns are active in the area of ecommerce and marketplaces.
Theme-wise; while the chart looks colorful, there are some parts missing; font-style and appropriate coloring. In the chart above, ggplot defaulted to the standard color palette for the plotted elements.
In order to fix this, I got the “unicorn” color palette from here and created two functions to overwrite the color and fill functions. Additionally, a true unicorn chart needs a different font. Even though it is pretty old, Comic Sans might be appropriate here.

windowsFonts(F = windowsFont('Comic Sans MS')) p <- ggplot(ddd[1:5,], aes(reorder(Industry, -m), m, color=Industry, fill=Industry)) + geom_point(size=4, shape="O") p+ theme_unicorn(base_size = 8,font="F") +scale_color_discrete_unicorn()+scale_fill_unicorn() + ylab("Mean Valuation in Bil$")+ xlab("Industries")

So, what’s missing to complete the unicorn theme? A real unicorn mixed with all the startups. Let’s add an png-image of an unicorn using annotation_custom(g, xmin=30, xmax=35, ymin=10, ymax=12). The annoying part of it is, that one needs to place the image somehow manually on top of the chart using the x,y parameters. I also tried to place it within a bar-chart with factors/categories along the x-axis but I have not been able to figure how to do that.

library(png) library(grid) ## adding an image img <- readPNG(source = "image/unicorn.png") g <- rasterGrob(img, interpolate=TRUE) p <- ggplot(d4, aes(N, m)) + geom_point(size=2, shape=1, color="purple") p<- p + theme_unicorn(base_size = 10,font="F") +xlab("")+ ylab("Mean Valuation in Bil$") + xlab("Number of Unicorns") p <- p + annotation_custom(g, xmin=30, xmax=35, ymin=10, ymax=12) p <- p + annotate(geom="text", x=3.5, y=12, label="Transportation", color="lightblue" , family=F) p <- p + annotate(geom="text", x=12, y=11.5, label="On Demand", color="lightblue") p <- p + annotate(geom="text", x=34, y=4, label="Ecommerce", color="lightblue") p

While ggplot certainly has massively improved in terms of style and customizability, there are some parts which do not work 100% (or I was not able to get it working properly).
Font-family setting leads to warnings/errors (even though the chart appears). Additionally, the passing of the font-family breaks down. The chart above illustrates that, the axis labels are Comic Sans, the annotated text is not. I tried to set it directly in the annotate-function, but that would not work either.

Another issue is color-setting. Colors have to be defined differently depending on whether one works with a continuous scale or a discrete scale. Hence, they cannot be set within the “theme”, but need to be individually set depending on the chart. While that makes somehow sense, I assume a general custom color-setting function would be highly appreciated. Another encountered issue -when setting custom colors- is the number of colors to be defined in the discrete case. If you pass too many classes (e.g. over 20 industry types as in chart 1), you need to define at least 20+ colors. For the custom scale_color_manual(), it would be great if it would provide a fallback to pick n colors along the colors passed.

I hope you enjoyed the colorful unicorn. Please reach out if you have solutions to the problems mentioned above.
I hope I will find the time to create a customizable function in the form of: theme_custom(colorpalette=colors[1:5])

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Free online course: Analyzing big data with Microsoft R Server

Tue, 2016-11-29 15:20

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you're already familiar with R, but struggling with out-of-memory or performance problems when attempting to analyze large data sets, you might want to check out this new EdX course, Analyzing Big Data with Microsoft R Server, presented by my colleague Seth Mottaghinejad. In the course, you'll learn how to build models using the RevoScaleR package, and deploy those models to production environments like Spark and SQL Server. The course is self-paced with videos, tutorials and tests, and is free to audit.

(By the way, if you don't already know R, you might want to check out the courses Introduction to R for Data Science and Programming in R for Data Science first.)

The RevoScaleR package isn't available on CRAN: it's included with Microsoft R Server and Microsoft R Client. You can download and use Microsoft R Client for free, which provides an installation of R with the RevoScaleR library built in and loaded when you start the session. An R IDE is also recommended: you can use R Tools for Visual Studio or RStudio.

The course is open now, and you can get started at EdX at the link below.

EdX: Analyzing Big Data with Microsoft R Server

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Repeated measures ANOVA in R Exercises

Tue, 2016-11-29 12:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

One way, two way and n way ANOVA are used to test difference in means when we have one, two and n factor variables. A key assumption when performing these ANOVAs is that the measurements are independent. When we have repeated measures this assumption is violated, so we have to use repeated measures ANOVA. Repeated measures designs occur often in longitudinal studies where we are interested in understanding change over time. For example a medical researcher would be interested in assessing the level of depression before and after a surgery procedure. Repeated measures designs are not limited to longitudinal studies, they can also be used when you have an important variable you would like to repeat measures. For example in a fitness experiment you can repeat your measures at different intensity levels. Repeated measures ANOVA can be considered an extension of the paired t test.

Before diving deeper into repeated measures ANOVA you need to understand terminology used. A subject is a member of the sample under consideration. In our medical study introduced earlier an individual patient is a subject. The within-subjects factor is the variable that identifies how the dependent variable has been repeatedly measured. In our medical study we would measure depression 4 weeks before surgery, 4 weeks after surgery and 8 weeks after surgery. The different conditions when repeated measurements are made are referred to as trials. A between-subjects factor identifies independent groups in the study. For example if we had two different procedures this would be the between subjects factor. These conditions are referred to as groups. Repeated measures analysis requires balance in between-subjects factor. For example subjects in each of surgery procedures need to be equal.

With a repeated measures design we are able to test the following hypotheses.

  1. There is no within-subjects main effect
  2. There is no between-subjects main effect
  3. There is no between subjects interaction effect
  4. There is no within subject by between subject interaction effect

There are two assumptions that need to be satisfied when using repeated measures.

  1. The dependent variable is normally distributed in each level of the within-subjects factor. Repeated measures analysis is robust to violations of normality with a large sample size which is considered at least 30 subjects. However the accuracy of p values is questionable when the distribution is heavily skewed or thick tailed.
  2. The variance across the within subject factor is equal. This is the sphericity assumption. Repeated measures analysis is not robust to this assumption so when there is a violation power decreases and a corresponding increase in probability of a type II error occurs. A Mauchly’s test assesses the null hypothesis variance is equal. The sphericity assumption is only relevant when there are more than 2 levels of the within subjects factor.

When the sphericity assumption is violated we make corrections by adjusting the degrees of freedom. Corrections available are Greenhouse-Geisser, Huynh-Feldt and Lower bound. To make a decision on appropriate correction we use a Greenhouse-Geisser estimate of sphericity (ξ). When ξ < 0.75 or we do not know anything about sphericity the Greenhouse-Geisser is the appropriate correction. When ξ > 0.75 Huynh-Feldt is the appropriate correction.

For this exercise we will use data on pulse rate exer. People were randomized to two diets, three exercise types and pulse was measured at three different time points. For this data time points is the within-subjects factor. The between-subjects factors are diet and exercise type

The solutions to the exercises below can be found here

Exercise 1

Load the data and inspect its structure

Exercise 2

Check for missing values

Exercise 3

Check for balance in between-subjects factor

Exercise 4

Generate descriptive statistics for the sex variable which is a between subjects factor

Exercise 5

Generate descriptive statistics for the treatment level variable which is a between subjects factor

Exercise 6

Generate descriptive statistics for the weeks variable which is the within subjects factor

Exercise 7

Use histograms to assess distribution across within subjects factor.

Exercise 8

Perform a repeated measures analysis with only the within subjects factor

Exercise 9

Perform a repeated measures analysis with the within subjects factor and one between subjects factor

Exercise 10

Perform a repeated measures analysis with the within subjects factor and two between subjects factors

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The Hitchhiker’s Guide to Ggplot2 in R

Tue, 2016-11-29 11:00

(This article was first published on Reimagined Invention, and kindly contributed to R-bloggers)

About the book

This is a technical book. The scope of the book is to go straight to the point and the writing style is similar to a recipe with detailed instructions. It is assumed that you know the basics of R and that you want to learn how to create beautiful plots.

Each chapter will explain how to create a different type of plot, and will take you step-by-step from a basic plot to a highly customised graph. The chapters’ order is by degree of difficulty.

Every chapter is independent from the others. You can read the whole book or go to a section of interest and we are sure that it will be easy to understand the instructions and reproduce our examples without reading the first chapters.

In total this book contains 218 pages (letter paper size) of different recipes to obtain an acceptable aesthetic result. You can download the book for free (yes, really!) from Leanpub.

Table of contents
  • Chapter 1: Line plots
  • Chapter 2: Area plots
  • Chapter 3: Bar pots
  • Chapter 4: Stacked bar plots
  • Chapter 5: Scatteplots
  • Chapter 6: Weighted scatterplots
  • Chapter 7: Histograms
  • Chapter 8: Density plots
  • Chapter 9: Function plots
  • Chapter 10: Boxplots
  • Chapter 11: Linear regression plots
  • Chapter 12: LOWESS plots

We also included a pack that contains the Rmd files that we used to generate every chart that is displayed in the book.

How the book started?

A few months ago I finished writing the eleventh tutorial in a series on using ggplot2 I created with Jodie Burchell. Jodie holds a PhD, now she’s in the industry and she loves data. I highly recommend reading her blog Standard Error where you can find really good material on Reproducible Research among other things.

A few weeks later those tutorials evolved into the shape of an ebook. The reason behind it was that what we started to write had an unexpected success, and we even had RTs from important people in the R community such as Hadley Wickham. Finally the book was released by Leanpub.

About me, I asked Jodie to co-write a blog entry when I found her blog and I realised that my interest in Data Science/科学数据专业化 and R Programming/R 编程语言 was reflected on her blog.

Why Leanpub?

Leanpub is a platform where you can easily write your book by using MS Word among other writing software and it even has GitHub and Dropbox integration. We went for R Markdown with LaTeX output, and that means that Leanpub is both easy to use and flexible at the same time.

Even more, Leanpub enables the audience to download your books for free, if you allow it, or you can define a price range with a suggested price indication. The website gives the authors a royalty of 90% minus 50 cents per sale (compared to other platforms this is convenient for the authors). You can also sell your books with additional exercises, lessons in video, etc.

Just two weeks ago I updated all the examples contained in my book just a few days after ggplot2 2.2.0 was released and my readers had a notification email just after I uploaded the new version. People who pays or does not pay for your books can download the newer versions of if for free.

What I learned from my first book?

At the moment I am teaching Data Visualization and from my students I learned that good visualizations come after they learn the visualization concepts. Coding cleary helps but coding goes after the fundamentals. It would be better to teach the fundamentals first and not in parallel with coding, and this applies specially when your audience has never wrote code before.

The interested reader may find some remarkable books that can be read before mine. I highly recommend:

Those are really good books that show the fundamentals of Data Visualisation and provide the key concepts and rules needed to communicate effectively with data.

To leave a comment for the author, please follow the link and comment on their blog: Reimagined Invention. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

DataChats: An interview with Garrett Grolemund

Tue, 2016-11-29 10:17

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

Hey R fans! We’ve just released a new episode of DataCamp‘s DataChats video series. Our first episode with Max Kuhn has already been watched over 700 times and this new episode promises to be a blockbuster as well. 

In this second episode we interview Garrett Grolemund. Garrett is a Data Scientist at RStudio and the author of Hands-On Programming with R and R for Data Science from O’Reilly Media. He talks us through how he discovered R, the evolution of R, what data science means to him and much more. 

We hope that you enjoy watching this series and make sure not to miss any of our upcoming fantastic episodes by subscribing to DataCamp’s youtube channel!

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A new `subprocess` package for R

Tue, 2016-11-29 03:26

Here’s a new package that brings to R new API to handle child processes – similar to how Python handles them.

Unlike the already available system() and system2() calls from the base package or the mclapply() function from the parallel package, this new API is aimed at handling long-lived child processes that can be controlled by the parent R process in a programmatic way. Here’s an example:

handle <- spawn_process("/usr/bin/sshpass", c("ssh", "-T", "user@domain.com")) process_write(handle, "password") process_write(handle, "ls\n") process_read(handle, "stdout") #> "bin" "public_html" "www-backup"

This of course can be done with system("ssh", c("user@domain.com", "ls")) as well (at least as long as password-less ssh connectivity is enabled). However, if there is a need to make a number of subsequent calls in response to user’s input, keeping a single connection open can save some time. Otherwise you need to wait for ssh to establish a new connection each time a new command is to be executed.

Perhaps a bit more silly example is working with a local (or remote, for that matter) Spark session. Imagine there is no package dedicated to Spark (which might well be the case with the next new thing that you find under your Christmas tree this year). The simplest approach could be to open Spark console and keep it alive while sending commands on its standard input and parsing the text output. However naive, this approach can save some prototyping time.

handle <- spawn_process("/usr/bin/spark-shell") process_write(handle, 'val textFile = sc.textFile("README.md")\n') process_write(handle, 'textFile.count()\n') process_read(handle) [1] "textFile: org.apache.spark.rdd.RDD[String] = README.md MapPart itionsRDD[1] at textFile at <console>:25" [2] "res0: Long = 126"

The new subprocess package is available from my GitHub account and CRAN. All functions can be run in both Linux and Windows and the few OS-specific details (like signals) are described in respective manual pages. There is also an introductory vignette.

I should also say that this package has been designed with Python’s subprocess module in mind, which (both package and language) I greatly admire. Its R equivalent is now in version 0.7.4 which is there to indicate that it’s a perfect equivalent. More (simultaneous wait on stdout and stderr) are still to come.

Categories: Methodology Blogs

Working at RStudio

Mon, 2016-11-28 19:57

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

I’ve joined Hadley’s team at RStudio.

Unsurprisingly, I’ll be working on some modeling related R packages and infrastructure. It is very exciting and I’m looking forward to learning a lot and creating some cool things.

I’ve had a great time doing drug discovery at Pfizer for the last 12ish years and I’ll miss working with everyone there.

My contact info has been changed on my existing R packages and you can find me either using my personal email (mxkuhn@gmail.com), RStudio address (max@rstudio.com), or github page

To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Seasonal Analysis in EGRET

Mon, 2016-11-28 19:00

(This article was first published on The USGS OWI blog , and kindly contributed to R-bloggers)

Introduction

This document describes how to obtain seasonal information from the R package EGRET – Exploration and Graphics for RivEr Trends: An R-package for the analysis of long-term changes in water quality and streamflow, including the water-quality method Weighted Regressions on Time, Discharge, and Season (WRTDS)

For example, we might want to know the fraction of the load that takes place in the winter season (say that is December, January, and February). We can look at the seasonal information for a single year, or averages over several years, or in terms of flow normalized fluxes.

Getting started

First, you need to have installed and loaded the EGRET package. Then, you’ll need need to create an eList object. See the EGRET vignette or user guide here for more information on how to use your own water quality data in EGRET. Once the eList object has been created, run the modelEstimation function to create a WRTDS model.

For this post, we will use the Choptank River as an example. There is an example data set included in EGRET with measured nitrate in the Choptank River. Here is an example of how to load that example data:

library(EGRET) eList <- Choptank_eList

To start with lets look at the results calculated for complete water years.

tableResults(eList) ## ## Choptank River ## Inorganic nitrogen (nitrate and nitrite) ## Water Year ## ## Year Discharge Conc FN_Conc Flux FN_Flux ## cms mg/L 10^6 kg/yr ## ## 1980 4.25 0.949 1.003 0.1154 0.106 ## 1981 2.22 1.035 0.999 0.0675 0.108 ## 1982 3.05 1.036 0.993 0.0985 0.110 ## 1983 4.99 1.007 0.993 0.1329 0.112 .... ## 2006 3.59 1.382 1.362 0.1409 0.147 ## 2007 4.28 1.408 1.382 0.1593 0.149 ## 2008 2.56 1.477 1.401 0.1008 0.149 ## 2009 3.68 1.409 1.419 0.1328 0.149 ## 2010 7.19 1.323 1.438 0.2236 0.149 ## 2011 5.24 1.438 1.457 0.1554 0.148

Looking at the last column of these results we see that, for example, the flow normalized flux in water year 2010 is estimated to be 0.149 106 kg/year. Now, let’s say we had a particular interest in the winter season which we define here as the months of December, January, and February. Note: some lines (1984-2005) were removed from this blog post simply to save space.

The next step is to establish what season you are interested in looking at. All functions in EGRET can be done on the “water year” (Oct-Sept), the calendar year (Jan-Dec), or any set of sequential months. To define what period of analysis (PA) to use, there is a function setPA. The setPA function has two arguments:

  • paStart is the number of the calendar month that is the start of the season.
  • paLong is the length of the season in months (it can be any number from 1 to 12).

For example let’s say we want to consider the winter, defined here as December through February, the starting month (paStart) would be 12, and the length (paLong) would be 3:

eList <- setPA(eList, paStart = 12, paLong = 3)

Now lets view our results when we are focusing on the winter season.

tableResults(eList) ## ## Choptank River ## Inorganic nitrogen (nitrate and nitrite) ## Season Consisting of Dec Jan Feb ## ## Year Discharge Conc FN_Conc Flux FN_Flux ## cms mg/L 10^6 kg/yr ## ## 1980 4.220 1.10 1.11 0.1403 0.156 ## 1981 1.960 1.20 1.13 0.0735 0.161 ## 1982 5.057 1.16 1.15 0.1764 0.165 ## 1983 3.504 1.21 1.16 0.1310 0.169 .... ## 2006 5.843 1.51 1.51 0.2675 0.214 ## 2007 5.417 1.55 1.53 0.2419 0.216 ## 2008 2.436 1.62 1.55 0.1217 0.217 ## 2009 2.711 1.66 1.56 0.1396 0.216 ## 2010 13.779 1.26 1.57 0.4161 0.215 ## 2011 3.369 1.66 1.57 0.1669 0.215

Note that now the estimated flow normalized flux is 0.215 106 kg/year. Let’s take a moment to think about that in relation to the previous result. What it is saying is that the flux (mass per unit time) is greater during this three month period that the average flux for the entire water year. That makes sense, some seasons will have higher than average fluxes and other seasons will have lower than average fluxes.

Now, it might be that we want to think about these results, not in terms of a flux but in terms of mass alone. In the first result (water year) the mass for the year is 0.149 106 kg. But for the winter season we would compute the mass as 0.215 * 90 / 365 to give us a result of 0.053 106 kg. To get this result we took the annual rate (0.215) and divided by the number of days in the year (365) to get a rate in mass per day and then multiplied by the number of days in the season (90) which is the sum of the length of those three months (31 + 31 + 28) to simply get a total mass.

We have taken pains here to run through this calculation because users have sometimes been confused, wondering how the flux for a part of the year can be greater than the flux for the whole year. Depending on the ultimate objective of the analysis one might want to present the seasonal results in either of these ways (mass or mass per unit time).

Next we will look at descriptions of change.

Seasonal Changes

Let’s use the tableChange function to explore the change from 1990 to 2010. We will do it first for the full water year and then for the winter season.

eList <- setPA(eList, paStart = 10, paLong = 12) tableChange(eList, yearPoints = c(1990,2010)) ## ## Choptank River ## Inorganic nitrogen (nitrate and nitrite) ## Water Year ## ## Concentration trends ## time span change slope change slope ## mg/L mg/L/yr % %/yr ## ## 1990 to 2010 0.31 0.016 28 1.4 ## ## ## Flux Trends ## time span change slope change slope ## 10^6 kg/yr 10^6 kg/yr /yr % %/yr ## 1990 to 2010 0.02 0.00099 15 0.76 eList <- setPA(eList, paStart = 12, paLong = 3) tableChange(eList, yearPoints = c(1990,2010)) ## ## Choptank River ## Inorganic nitrogen (nitrate and nitrite) ## Season Consisting of Dec Jan Feb ## ## Concentration trends ## time span change slope change slope ## mg/L mg/L/yr % %/yr ## ## 1990 to 2010 0.24 0.012 18 0.89 ## ## ## Flux Trends ## time span change slope change slope ## 10^6 kg/yr 10^6 kg/yr /yr % %/yr ## 1990 to 2010 0.02 0.001 10 0.51

What we see is a fairly large difference between the concentration trends for the winter as compared to the whole year (concentration rose more steeply for the winter than it did for the year on average). But what we are going to focus on here is the trend in flux. The first thing to note is that the change from 1990 to 2010 is identical for the winter season and the year as a whole. That is, the change in the flow normalized flux for the full water year is +0.020 106 kg/year (it went from 0.129 to 0.149) and then, when we look at the winter season the change is also +0.020 106 kg/yr (from 0.195 to 0.215). So the change for this season over the 20 year period is essentially the same as the change for the entire water year. In other words, the results tell us that although the change in flux (mass per unit time) for the winter is the same as for the full year, the change in the mass for the winter season is about 25% of the change for the full year (because the winter consists of about 25% of the days in the year). Thus, we can conclude that the winter change is not atypical of the changes for the other parts of the year.

The results show above also express the change as a slope (either 0.001 or 0.00099 virtually identical to each other) and these are simply the change results divided by the number of years. The next entry in the tableChange output is for change expressed as %. Here we see a big difference between the winter and the whole year. The whole year shows an increase of + 15 % over the 20 years, while the winter season shows an increase of 10 %. That is because the same amount of increase 0.02 106 kg / year is being compared to a the smaller number (1990 flow normalized annual flux of 0.129 106 kg/year) in the first table and compared to a larger number (seasonal flux of 0.195 106 kg/year) in the second table. So, even though the change is focused equally in the winter and non-winter months, the percentage change for the winter is smaller than the percentage change for the whole year.

Seasonal Load Fraction

Next, we can think about the seasonal load fraction.

You will need to read in two new function called setupSeasons and setupYearsPlus designed for this purpose. You can copy them from here and paste them into your workspace (all as a single copy and paste) or you can create an .R file from them that you will source each time you want to use them. The functions use the package dplyr, a package that is useful for general data exploration and manipulation.

library(dplyr) setupSeasons <- function(eList){ Daily <- eList$Daily SeasonResults <- setupYearsPlus(Daily, paLong = eList$INFO$paLong, paStart = eList$INFO$paStart) AnnualResults <- setupYearsPlus(Daily, paLong = 12, paStart = eList$INFO$paStart) %>% filter(Counts >= 365) #To make sure we are getting full years divideBy <- 1000000 annualPctResults <- AnnualResults %>% mutate(FluxYear = Flux*Counts/divideBy, FNFluxYear = FNFlux*Counts/divideBy) %>% select(FluxYear, FNFluxYear, Year) seasonPctResults <- SeasonResults %>% mutate(FluxSeason = Flux*Counts/divideBy, FNFluxSeason = FNFlux*Counts/divideBy) %>% left_join(annualPctResults, by="Year") %>% mutate(pctFlux = 100*FluxSeason/FluxYear, pctFNFlux = 100*FNFluxSeason/FNFluxYear) %>% select(-Q, -Conc, -Flux, -FNFlux, -FNConc, -Counts) %>% rename(seasonStart = paStart, seasonLong = paLong) return(seasonPctResults) } setupYearsPlus <- function (Daily, paLong = 12, paStart = 10){ monthsToUse <- seq(paStart, length=paLong) monthsToUse[monthsToUse > 12] <- monthsToUse[monthsToUse > 12] - 12 crossesYear <- paLong + (paStart - 1) > 12 AnnualResults <- Daily %>% mutate(Year = as.integer(format(Date, "%Y"))) %>% filter(Month %in% monthsToUse) %>% mutate(Year = if(crossesYear){ ifelse(Month >= paStart, Year + 1, Year) } else { Year }) %>% group_by(Year) %>% summarise(DecYear = mean(DecYear, na.rm = TRUE), Q = mean(Q, na.rm = TRUE), Conc = mean(ConcDay, na.rm = TRUE), Flux = mean(FluxDay, na.rm = TRUE), FNConc = mean(FNConc, na.rm = TRUE), FNFlux = mean(FNFlux, na.rm = TRUE), Counts = sum(!is.na(ConcDay))) %>% mutate(paLong = paLong, paStart = paStart) return(AnnualResults) }

Simply use the loaded eList to calculate these seasonal load fractions. Let’s go back to the winter season (Dec-Feb):

eList <- setPA(eList, paStart = 12, paLong = 3) seasonPctResults <- setupSeasons(eList) Looking at your results

What you now have is a data frame called seasonPctResults. The columns it contains are the following:

variable Definition DecYear Decimal Year of the mid-date of the season Year Calendary Year of mid-date of the year FluxYear Estimated flux for the year in millions of kg FNFluxYear Flow Normalized flux for the year in millions of kg FluxSeason Estimated flux for the season in millions of kg FNFluxSeason Flow Normalized flux for the season in millions of kg pctFlux Season flux as a percentage of Annual Flux pctFNFlux FlowNormalized Seasonal Flux as a percent of Flow Normalized Annual Flux seasonLong Length of the Season in Months seasonStart Starting Month of the Season, 1=January Plotting the time series

We can make a graph showing the percentage flux (estimated annual and flow normalized). Note, this workflow uses base R plotting functions. You could also use the EGRET function genericEGRETDotPlot to automatically to pick some plotting styles that are consistent with other EGRET plots.

plotTitle = paste("Seasonal Flux as a Percent of Annual Flux\n", eList$INFO$shortName, eList$INFO$paramShortName, "\nSolid line is percentage of flow normalized flux") par(mar=c(5,6,4,2) + 0.1,mgp=c(3,1,0)) plot(seasonPctResults$DecYear, seasonPctResults$pctFlux,pch=20, yaxs="i",ylim = c(0,100),las=1,tck=.01, xlab="Year",ylab="Percentage of Annual Flux", main=plotTitle,cex=1.5) lines(seasonPctResults$DecYear,seasonPctResults$pctFNFlux,col="green",lwd=2) axis(3, labels = FALSE,tck=.01) axis(4, labels = FALSE,tck=.01)

We can interpret this example graph as follows. The winter flux of nitrate fluctuates a good deal from year to year. From a low of around 10% to a high of around 60% but the mean percentage hasn’t changed much over the years. It is around 35% of the annual total flux.

Computing averages over a period of years

Let’s say we wanted to answer the question, what percentage of the annual total flux moved in the winter season during the years 2000 through 2010. We can answer that question with a simple set of calculations.

Keep in mind, the way we are defining “year” is what year the ending year of the period of anaylsis fell. So, for this analysis, the full 2010 “year” is from Dec. 2009 through the end of November 2010.

  • Filter the data frame seasonPctResults for the years 2000 – 2010.

  • Now we can compute the sum of the annual fluxes for those years and the sum of the seasonal fluxes for those years, and then get our answer by taking the ratio and multiplying by 100.

years00_10 <- filter(seasonPctResults, Year >= 2000, Year <= 2010) sumYears <- sum(years00_10$FluxYear) sumSeasons <- sum(years00_10$FluxSeason) avePct <- 100 * sumSeasons / sumYears

The total flux for all years in the period of interest in millions of kg is sumYears = 1.7297595.

The total seasonal flux for all years of the period of interest in millions of kg is sumSeasons = 0.6099614.

The percentage of the total flux for the years 2000 through 2010 that was transported in the winter months is avePct = 35.2627852.

This can be determined for any set of years simply by changing the two numbers inside the brackets to the index numbers of the first and last years of interest.

Questions
  • Robert M. Hirsch

  • Laura DeCicco

Information on USGS-R packages used in this post:

EGRET: Exploration and Graphics for RivEr Trends: An R-package for the analysis of long-term changes in water quality and streamflow, including the water-quality method Weighted Regressions on Time, Discharge, and Season (WRTDS) dataRetrieval: This R package is designed to obtain USGS or EPA water quality sample data, streamflow data, and metadata directly from web services.

To leave a comment for the author, please follow the link and comment on their blog: The USGS OWI blog . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

elasticsearchr – a Lightweight Elasticsearch Client for R

Mon, 2016-11-28 18:00

(This article was first published on R – When Localhost Isn't Enough, and kindly contributed to R-bloggers)

elasticsearchr: a Lightweight Elasticsearch Client for R

Elasticsearch is a distributed NoSQL document store search-engine and column-oriented database, whose fast (near real-time) reads and powerful aggregation engine make it an excellent choice as an ‘analytics database’ for R&D, production-use or both. Installation is simple, it ships with sensible default settings that allow it to work effectively out-of-the-box, and all interaction is made via a set of intuitive and extremely well documented RESTful APIs. I’ve been using it for two years now and I am evangelical.

The elasticsearchr package implements a simple Domain-Specific Language (DSL) for indexing, deleting, querying, sorting and aggregating data in Elasticsearch, from within R. The main purpose of this package is to remove the labour involved with assembling HTTP requests to Elasticsearch’s REST APIs and processing the responses. Instead, users of this package need only send and receive data frames to Elasticsearch resources. Users needing richer functionality are encouraged to investigate the excellent elastic package from the good people at rOpenSci.

This package is available on CRAN or from this GitHub repository. To install the latest development version from GitHub, make sure that you have the devtools package installed (this comes bundled with RStudio), and then execute the following on the R command line:

devtools::install_github("alexioannides/elasticsearchr") Installing Elasticsearch

Elasticsearch can be downloaded here, where the instructions for installing and starting it can also be found. OS X users (such as myself) can also make use of Homebrew to install it with the command,

$ brew install elasticsearch

And then start it by executing $ elasticsearch from within any Terminal window. Successful installation and start-up can be checked by navigating any web browser to http://localhost:9200, where the following message should greet you (give or take the cluster name that changes with every restart),

{ "name" : "Kraven the Hunter", "cluster_name" : "elasticsearch", "version" : { "number" : "2.3.5", "build_hash" : "90f439ff60a3c0f497f91663701e64ccd01edbb4", "build_timestamp" : "2016-07-27T10:36:52Z", "build_snapshot" : false, "lucene_version" : "5.5.0" }, "tagline" : "You Know, for Search" } Elasticsearch 101

If you followed the installation steps above, you have just installed a single Elasticsearch ‘node’. When not testing on your laptop, Elasticsearch usually comes in clusters of nodes (usually there are at least 3). The easiest easy way to get access to a managed Elasticsearch cluster is by using the Elastic Cloud managed service provided by Elastic (note that Amazon Web Services offer something similar too). For the rest of this brief tutorial I will assuming you’re running a single node on your laptop (a great way of working with data that is too big for memory).

In Elasticsearch a ‘row’ of data is stored as a ‘document’. A document is a JSON object – for example, the first row of R’s iris dataset,

# sepal_length sepal_width petal_length petal_width species # 1 5.1 3.5 1.4 0.2 setosa

would be represented as follows using JSON,

{ "sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2, "species": "setosa" }

Documents are classified into ‘types’ and stored in an ‘index’. In a crude analogy with traditional SQL databases that is often used, we would associate an index with a database instance and the document types as tables within that database. In practice this example is not accurate – it is better to think of all documents as residing in a single – possibly sparse – table (defined by the index), where the document types represent non-unique sub-sets of columns in the table. This is especially so as fields that occur in multiple document types (within the same index), must have the same data-type – for example, if "name" exists in document type customer as well as in document type address, then "name" will need to be a string in both.

Each document is considered a ‘resource’ that has a Uniform Resource Locator (URL) associated with it. Elasticsearch URLs all have the following format: http://your_cluster:9200/your_index/your_doc_type/your_doc_id. For example, the above iris document could be living at http://localhost:9200/iris/data/1 – you could even point a web browser to this location and investigate the document’s contents.

Although Elasticsearch – like most NoSQL databases – is often referred to as being ‘schema free’, as we have already see this is not entirely correct. What is true, however, is that the schema – or ‘mapping’ as it’s called in Elasticsearch – does not need to be declared up-front (although you certainly can do this). Elasticsearch is more than capable of guessing the types of fields based on new data indexed for the first time.

For more information on any of these basic concepts take a look here

elasticsearchr: a Quick Start

elasticsearchr is a lightweight client – by this I mean that it only aims to do ‘just enough’ work to make using Elasticsearch with R easy and intuitive. You will still need to read the Elasticsearch documentation to understand how to compose queries and aggregations. What follows is a quick summary of what is possible.

Elasticsearch Data Resources

Elasticsearch resources, as defined by the URLs described above, are defined as elastic objects in elasticsearchr. For example,

es <- elastic("http://localhost:9200", "iris", "data")

Refers to documents of type ‘data’ in the ‘iris’ index located on an Elasticsearch node on my laptop. Note that:
– it is possible to leave the document type empty if you need to refer to all documents in an index; and,
– elastic objects can be defined even if the underling resources have yet to be brought into existence.

Indexing New Data

To index (insert) data from a data frame, use the %index% operator as follows:

elastic("http://localhost:9200", "iris", "data") %index% iris

In this example, the iris dataset is indexed into the ‘iris’ index and given a document type called ‘data’. Note that I have not provided any document ids here. To explicitly specify document ids there must be a column in the data frame that is labelled id, from which the document ids will be taken.

Deleting Data

Documents can be deleted in three different ways using the %delete% operator. Firstly, an entire index (including the mapping information) can be erased by referencing just the index in the resource – e.g.,

elastic("http://localhost:9200", "iris") %delete% TRUE

Alternatively, documents can be deleted on a type-by-type basis leaving the index and it’s mappings untouched, by referencing both the index and the document type as the resource – e.g.,

elastic("http://localhost:9200", "iris", "data") %delete% TRUE

Finally, specific documents can be deleted by referencing their ids directly – e.g.,

elastic("http://localhost:9200", "iris", "data") %delete% c("1", "2", "3", "4", "5") Queries

Any type of query that Elasticsearch makes available can be defined in a query object using the native Elasticsearch JSON syntax – e.g. to match every document we could use the match_all query,

for_everything <- query('{ "match_all": {} }')

To execute this query we use the %search% operator on the appropriate resource – e.g.,

elastic("http://localhost:9200", "iris", "data") %search% for_everything # sepal_length sepal_width petal_length petal_width species # 1 4.9 3.0 1.4 0.2 setosa # 2 4.9 3.1 1.5 0.1 setosa # 3 5.8 4.0 1.2 0.2 setosa # 4 5.4 3.9 1.3 0.4 setosa # 5 5.1 3.5 1.4 0.3 setosa # 6 5.4 3.4 1.7 0.2 setosa # ... Sorting Query Results

Query results can be sorted on multiple fields by defining a sort object using the same Elasticsearch JSON syntax – e.g. to sort by sepal_width in ascending order the required sort object would be defined as,

by_sepal_width <- sort('{"sepal_width": {"order": "asc"}}')

This is then added to a query object whose results we want sorted and executed using the %search% operator as before – e.g.,

elastic("http://localhost:9200", "iris", "data") %search% (for_everything + by_sepal_width) # sepal_length sepal_width petal_length petal_width species # 1 5.0 2.0 3.5 1.0 versicolor # 2 6.0 2.2 5.0 1.5 virginica # 3 6.0 2.2 4.0 1.0 versicolor # 4 6.2 2.2 4.5 1.5 versicolor # 5 4.5 2.3 1.3 0.3 setosa # 6 6.3 2.3 4.4 1.3 versicolor # ... Aggregations

Similarly, any type of aggregation that Elasticsearch makes available can be defined in an aggs object – e.g. to compute the average sepal_width per-species of flower we would specify the following aggregation,

avg_sepal_width <- aggs('{ "avg_sepal_width_per_species": { "terms": { "field": "species", "size": 3 }, "aggs": { "avg_sepal_width": { "avg": { "field": "sepal_width" } } } } }')

(Elasticsearch 5.x users please note that when using the out-of-the-box mappings the above aggregation requires that "field": "species" be changed to "field": "species.keyword" – see here for more information as to why)

This aggregation is also executed via the %search% operator on the appropriate resource – e.g.,

elastic("http://localhost:9200", "iris", "data") %search% avg_sepal_width # key doc_count avg_sepal_width.value # 1 setosa 50 3.428 # 2 versicolor 50 2.770 # 3 virginica 50 2.974

Queries and aggregations can be combined such that the aggregations are computed on the results of the query. For example, to execute the combination of the above query and aggregation, we would execute,

elastic("http://localhost:9200", "iris", "data") %search% (for_everything + avg_sepal_width) # key doc_count avg_sepal_width.value # 1 setosa 50 3.428 # 2 versicolor 50 2.770 # 3 virginica 50 2.974

where the combination yields,

print(for_everything + avg_sepal_width) # { # "size": 0, # "query": { # "match_all": { # # } # }, # "aggs": { # "avg_sepal_width_per_species": { # "terms": { # "field": "species", # "size": 0 # }, # "aggs": { # "avg_sepal_width": { # "avg": { # "field": "sepal_width" # } # } # } # } # } # }

For comprehensive coverage of all query and aggregations types please refer to the rather excellent official documentation (newcomers to Elasticsearch are advised to start with the ‘Query String’ query).

Mappings

Finally, I have included the ability to create an empty index with a custom mapping, using the %create% operator – e.g.,

elastic("http://localhost:9200", "iris") %create% mapping_default_simple()

Where in this instance mapping_default_simple() is a default mapping that I have shipped with elasticsearchr. It switches-off the text analyser for all fields of type ‘string’ (i.e. switches off free text search), allows all text search to work with case-insensitive lower-case terms, and maps any field with the name ‘timestamp’ to type ‘date’, so long as it has the appropriate string or long format.

Forthcoming Attractions

I do not have a grand vision for elasticsearchr – I want to keep it a lightweight client that requires knowledge of Elasticsearch – but I would like to add the ability to compose major query and aggregation types, without having to type-out lots of JSON, and to be able to retrieve simple information like the names of all indices in a cluster, and all the document types within an index, etc. Future development will likely be focused in these areas, but I am open to your suggestions (open an issue here).

Acknowledgements

A big thank you to Hadley Wickham and Jeroen Ooms, the authors of the httr and jsonlite packages that elasticsearchr leans upon heavily.

To leave a comment for the author, please follow the link and comment on their blog: R – When Localhost Isn't Enough. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Bill and Ted Make the best out of a Shi… Stata situation: Rstudio + Rstata + Stata

Mon, 2016-11-28 12:43

(This article was first published on rstats – MikeJackTzen, and kindly contributed to R-bloggers)

After rewatching the thanksgiving classic, Bill and Ted’s Excellent Adventure, it reminded me of the history of #Rstats and its current status as the defacto software for general data programming.

The most excellent thing about R, is the literate programming options you have. As a data analyst, you are Bill S. Preston Esquire (or Ted “Theodore” Logan, they are exchangeable). Rstudio is the time traveling phone booth. Since its conception, Rstats had Sweave’s phone number on speed dial. Now, Rstudio has Rmarkdown. Compare this situation with…  Stata. Stata is Ghenkis Khan.

Seeing Mine Çetinkaya-Rundel post about the joys of Stata,

During these discussions a package called RStata also came up. This package is [a] simple R -> Stata interface allowing the user to execute Stata commands (both inline and from a .do file) from R.” Looks promising as it should allow running Stata commands from an R Markdown chunk. But it’s really not realistic to think students learning Stata for the first time will learn well (and easily) using this R interface. I can’t imagine teaching Stata and saying to students “first download R”. Not that I teach Stata, but those who do confirmed that it would be an odd experience for students…

I decided to see for myself how (un)approachable writing narratives for literate programming in Stata really is.

If Plato pitched his ideal to So-crates, he would claim:

Integrating Rstudio + Rmarkdown + R + RStata, should give you the best of 3 worlds

1) Write narratives that are human-readable

2) Manipulate data with human-readable R code

3) Have ‘paid-for-assurance’ of Stata analysis commands

But!  Bill and Ted would probably get bogged down during the setup. The key overhead step is to make sure Bill’s RStata package plays nicely with his local copy of Stata.

This is like chaparoning Ghenkis Khan in a shopping mall by letting him run loose without an adult-sized child leash.  He might be enjoying a delicious Cinnabon all by his lonesome, or he might be playing home run derby with a mannequin’s head.

It depends on Ghengis’ mood aka the disgruntled software versions in his computing environment.

The setup overhead is definitely an obstacle against adoption. You need to also version control Rstudio (undergoing rapid development) for its notebook feature and you need to align the Stata version (with their yearly business as usual updates).

I can only see this being useful if Ted is a Stata user with a near ‘final’ Stata .do file that he wants to share in a reproducible manner. During his presentation to his high school history class, Ted would just narrate his analysis center stage via markdown and whenever a result needs to be piped in, he could just chunk-source the .do file in Rstudio (like pulling Ghengis Khan out of the phone booth). Most Excellent.

The gist below is my standalone Rnotebook demo that should work if you have Rstudio 1.0.44 and Stata 14. Your Mileage May Very, with or without a time traveling phone booth.

https://gist.github.com/statsccpr/5f4bb658c15ff2a31b3ba0c0afae228d#file-rstata_boomerang_rnotebook-rmd

 

To leave a comment for the author, please follow the link and comment on their blog: rstats – MikeJackTzen. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Data frame exercises Vol. 2

Mon, 2016-11-28 12:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

[In the exercises below we cover the basics of data frames.]

Answers to the exercises are available here.

Exercise 1

Consider two vectors:
x=seq(1,43,along.with=Id)
y=seq(-20,0,along.with=Id)
Create a data.frame df:

>df
Id Letter x y
1 1 a 1.000000 -20.000000
2 1 b 4.818182 -18.181818
3 1 c 8.636364 -16.363636
4 2 a 12.454545 -14.545455
5 2 b 16.272727 -12.727273
6 2 c 20.090909 -10.909091
7 3 a 23.909091 -9.090909
8 3 b 27.727273 -7.272727
9 3 c 31.545455 -5.454545
10 4 a 35.363636 -3.636364
11 4 b 39.181818 -1.818182
12 4 c 43.000000 0.000000

Exercise 2

From the previous one data frame df.
Create this data frame:

Id x.a y.a x.b y.b x.c y.c
1 1 1.00000 -20.000000 4.818182 -18.181818 8.636364 -16.363636
4 2 12.45455 -14.545455 16.272727 -12.727273 20.090909 -10.909091
7 3 23.90909 -9.090909 27.727273 -7.272727 31.545455 -5.454545
10 4 35.36364 -3.636364 39.181818 -1.818182 43.000000 0.000000

Exercise 3

Create two data frame df1 and df2:

> df1
Id Age
1 1 14
2 2 12
3 3 15
4 4 10
> df2
Id Sex Code
1 1 F a
2 2 M b
3 3 M c
4 4 F d

From df1 and df2 create M:

>M
Id Age Sex Code
1 1 14 F a
2 2 12 M b
3 3 15 M c
4 4 10 F d

Exercise 4

Create a data frame df3:

> df3
id2 score
1 4 100
2 3 98
3 2 94
4 1 99

From M and df3 create N:

Id Age Sex Code score
1 1 14 F a 99
2 2 12 M b 94
3 3 15 M c 98
4 4 10 F d 100

Exercise 5

Consider the previous one data frame N:
1)Remove the variables Sex and Code
2)From N, create a data frame:


values ind

1 1 Id
2 2 Id
3 3 Id
4 4 Id
5 14 Age
6 12 Age
7 15 Age
8 10 Age
9 99 score
10 94 score
11 98 score
12 100 score

Exercise 6

For this exercise, we’ll use the (built-in) dataset trees.
a) Make sure the object is a data frame, if not change it to a data frame.
b) Create a new data frame A:

>A
Girth Height Volume
mean_tree 13.24839 76 30.17097
min_tree 8.30000 63 10.20000
max_tree 20.60000 87 77.00000
sum_tree 410.70000 2356 935.30000

Exercise 7

Consider the data frame A:
1)Order the entire data frame by the first column.
2)Rename the row names as follows: mean, min, max, tree

Exercise 8

Create an empty data frame with column types:


> df
Ints Logicals Doubles Characters
(or 0-length row.names)

Exercise 9

Create a data frame XY

X=c(1,2,3,1,4,5,2)
Y=c(0,3,2,0,5,9,3)
> XY
X Y
1 1 0
2 2 3
3 3 2
4 1 0
5 4 5
6 5 9
7 2 3

1)looks at duplicated elements using a provided R function.
2) keeps only the uniques lines on XY using a provided R function.

Exercise 10
For this exercise, we’ll use the (built-in) dataset Titanic.
a) Make sure the object is a data frame, if not change it to a data frame.
b) Define a data frame with value 1st in Class variable, and value NO in Survived variable
and variables Sex, Age and Freq.

Sex Age Freq
1 Male Child 0
5 Female Child 0
9 Male Adult 118
13 Female Adult 4

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A heat map of Divvy bike riders in Chicago

Mon, 2016-11-28 12:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Chicago's a great city for a bike-sharing service. It's pretty flat, and there are lots of wide roads with cycle lanes. I love Divvy and use it all the time. Not so much in the winter though: it gets very cold here. Nonetheless, this heat map of Divvy riders, created in R by Austin Wehrwein, reveals a hardcore set of riders that use the service even in the bitter cold. For example on January 27 2014 when the temperature was °F (-18°C), there were still 406 riders that day.

There's no calendar heatmap extension to ggplot2 (that I know of), but as Austin shows below it's actually very easy to create one in base ggplot2 simply by defining weekday, month and year columns in the data.

Austin Wehrwein: Heatmaps with Divvy Data

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Simulation in R For AP Statistics

Mon, 2016-11-28 06:00

(This article was first published on R – Saturn Science, and kindly contributed to R-bloggers)

Make a simulation and function using R.
My Class at Work With R

I continue to teach some R programming in my AP Stats class because this is an essential skill for them to have.

We are now in chapter five of our textbook, Probability and Simulation. I had my students review the parking problem on page 290. The problem asked to make a simulation of picking two students from a total of 95 in the high school and finding the probability that these two students are from the same class of only 28 students.

The textbook solves the simulation problem by using the random number table in the back of the book. By reading across, you read two digits at a time and check to see if the two numbers are less than or equal to 28.

To use the random digits chart, you need to number the students from 01 to 95. Since there are 28 students in the class that won the lottery, we want to simulate the probability of picking two students from this class of 28 students from the total population of 95.

For this lesson, I wanted my class to explore the concept of simulation more in depth and I thought using R would be better at teaching this lesson than using our TI Nspire calculators. The book’s example gives the simulated probability at about 10 percent.

To begin, I demonstrated to them the sample function in class on the Smartboard with RStudio. I said, “Lets take a sample 100 times and see what it looks like.” I demonstrated and displayed the basic outline of the sample function as follows:

sample (from this number : to this number, how many, replace = T or F)

sample(1:95,2, replace = FALSE)

## [1] 56 45

Here is the result of just using sample. If both numbers are less than 29, it means both students were picked from that one class. But what if you wanted to repeat this sampling technique? You would have to keep running this line of code, and is seems that there should be a better way. The better way is to use the replicate function.

How to use Replicate
Class Lesson and Rmd Code on Smartboard

Here we will use replicate to take a sample 100 times and check to see how many times we get a pair of numbers from 1 to 28.

To use replicate, we specify two arguments: n tells R how many times we want to repeat something and expr is the R command we want to repeat.

For example:

replicate(n=100, expr = sample(1:95,2, replace = FALSE)) # take a sample 100 times.

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] ## [1,] 13 20 23 69 16 13 20 42 44 93 75 68 55 ## [2,] 73 67 67 46 62 17 92 46 73 65 1 93 61 ## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] ## [1,] 88 28 62 49 39 46 36 6 46 76 67 ## [2,] 70 68 69 74 74 89 55 65 60 26 94 ## [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] ## [1,] 49 55 14 77 23 24 92 60 83 80 42 ## [2,] 34 60 88 93 61 1 28 41 85 63 44 ## [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] ## [1,] 30 37 47 61 61 89 80 87 28 12 3 ## [2,] 38 19 58 74 1 66 2 67 22 73 1 ## [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57] ## [1,] 42 79 64 3 5 37 85 42 63 37 1 ## [2,] 47 2 60 14 8 19 14 28 74 6 65 ## [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66] [,67] [,68] ## [1,] 32 78 95 30 69 69 56 43 77 7 35 ## [2,] 4 14 88 84 46 95 21 44 30 73 7 ## [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77] [,78] [,79] ## [1,] 42 8 30 37 55 3 75 33 16 88 86 ## [2,] 58 53 10 22 69 45 22 24 22 55 54 ## [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] ## [1,] 83 70 58 59 87 75 52 51 22 38 91 ## [2,] 44 14 62 11 64 30 29 91 25 6 51 ## [,91] [,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99] [,100] ## [1,] 61 13 2 6 86 11 28 6 6 5 ## [2,] 70 94 75 34 63 64 54 45 88 56

We can also shorten this a bit like this to get the same results.

replicate(100, sample(1:95,2, replace = FALSE))

Writing a function:

Another way to do this simulation is to write a short function and call that function in replicate. Here is a simple function that will take a sample of 2 from 95 and run it a specified number of times.

Use a function and replicate
R Markdown makes it easier to publish

Here I created a function called park where it samples without replacement 2 numbers from a list of 95. Then we call the function park in replicate.

The word function must be followed by parentheses. It tells R that what comes next is a function. The curly braces, {}, are the beginning and ending of your function. Everything between them is part of your function.

The return() statement is the ending of your function. What you put between the parentheses is returned from inside the function to your work space. Here I use the assignment operator to put the function in an object called park.

park <- function(){ win <- sample(1:95,2,replace = FALSE) return(win) } replicate(n=100, park())

## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] ## [1,] 5 14 34 85 28 36 94 46 45 47 93 46 26 ## [2,] 47 50 38 62 34 42 9 93 85 89 54 28 70 ## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] ## [1,] 67 52 56 44 94 43 86 69 26 34 81 ## [2,] 73 7 57 58 8 8 20 15 3 92 13 ## [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] ## [1,] 38 43 59 19 71 81 67 22 45 27 70 ## [2,] 72 73 65 59 32 34 93 42 17 3 89 ## [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] ## [1,] 70 21 2 18 8 89 89 12 15 7 18 ## [2,] 59 7 34 45 74 18 91 83 67 2 77 ## [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57] ## [1,] 69 20 69 18 85 92 83 44 13 74 19 ## [2,] 35 43 9 68 70 88 95 14 52 21 46 ## [,58] [,59] [,60] [,61] [,62] [,63] [,64] [,65] [,66] [,67] [,68] ## [1,] 21 4 42 87 68 56 78 8 26 29 14 ## [2,] 17 26 41 25 67 5 59 6 21 26 50 ## [,69] [,70] [,71] [,72] [,73] [,74] [,75] [,76] [,77] [,78] [,79] ## [1,] 92 88 48 59 10 21 10 67 5 25 9 ## [2,] 69 30 81 22 72 95 40 73 95 34 16 ## [,80] [,81] [,82] [,83] [,84] [,85] [,86] [,87] [,88] [,89] [,90] ## [1,] 61 43 95 90 72 12 61 27 63 17 14 ## [2,] 39 88 18 18 86 54 94 14 84 89 84 ## [,91] [,92] [,93] [,94] [,95] [,96] [,97] [,98] [,99] [,100] ## [1,] 3 94 46 13 9 12 88 43 5 78 ## [2,] 7 73 52 20 61 28 5 34 65 21

Extension

One student asked if it were possible to make a for loop and I said yes. I also said that for loops run a little slow on large data sets and for our purpose, it might be better to just use replicate or make a function.

Their assignment is due in two days. They are to knit to Word and upload to our class Moodle site. They can also submit their Rmd file to Moodle.

For extra credit, I will have them try to figure out how to tally the percent chosen from the simulation. I suspect we will be using some type of loop or apply function.

We will learn in the next section the multiplication rule for probabilities. The probability for this problem is (28/95)(27/94) = 0.0849 or about 8.5%

To leave a comment for the author, please follow the link and comment on their blog: R – Saturn Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs