R bloggers

Syndicate content
R news and tutorials contributed by (580) R bloggers
Updated: 6 hours 21 min ago

Mapping medical cannabis dispensaries: from PDF table to Google Map with R

Tue, 2016-08-16 11:44

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Sheri Gilley, Microsoft Senior Software Engineer

In 2014, Illinois passed into law the creation of a medical cannabis pilot program. As my son has cancer and marijuana could greatly help with some of his symptoms, we eagerly applied for a card when registration was available early in 2015. The first dispensaries were not available until November 2015. At that time there were 9 dispensaries; the PDF file with a table of dispensary names and locations provided by the Illinois Department of Health was an adequate way to find a dispensary.

In the time that dispensaries have been available, my son has been in various hospitals and facilities in and around the city of Chicago.  First we were in Park Ridge, then Hyde Park, then Hinsdale, then downtown Chicago and now finally back home in Oak Park.  As we moved around the city, I would use that same PDF file to locate the dispensary closest to me. The list has grown from 9 names and addresses to 40 today. With 40 entries, the PDF table format is not at all useful for showing where the dispensaries are located.  The entries are listed in the order of the license issue date, making it all the more difficult to see which dispensaries might be easiest for me to visit.

So one weekend I decided to create a map of all the current locations.  Keeping in mind that more dispensaries will be available in the future, I wanted to create code that would read the official list of registered dispensaries, so that updates would be easy as more entries were added.

I knew I could read the text of the file in R using pdftools, and could put the locations onto a google map using googleVis.  The hardest part of the code was trying to filter out the noise included in the text and reliably get the name, address, and phone number of each dispensary into a data frame. A few handy gsub statements worked their magic and I was left with data ready for mapping.

I added in some geocoding to get the longitude and latitude, thanks to this tip.

Finally, after the data manipulation, the code to produce the map itself is rather straightforward:

# create id  and LatLong for googleVis map all$id <- paste(all$name, all$address, all$phone, sep='; ') all$LatLong = paste(result$lat, result$long, sep=":")

# Now plot with googleVis
require(googleVis)

g1 <- gvisMap(all, “LatLong” , “id”,
options=list(showTip=TRUE,
showLine=TRUE,
enableScrollWheel=TRUE,
mapType=’normal’,
width=400, height=400
))

# this opens a browser with the plot
plot(g1)

# write the code that will be used on my website
cat(g1$html$chart, file=”dispensariesIL.html”)

I thought the map might be useful for others in Illinois also interested in finding a dispensary, so I created a small website for the map, and include a few other tips and tricks I have learned along the way by being the procurer of medicinal cannabis in my household.

Here is the resulting map, generated on August 12, 2016 (click to see the interactive version):

You can create this map yourself with this R code.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

ggtree paper published

Tue, 2016-08-16 10:10

(This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

Today ggtree received 100 stars on and I found the paper was online at the same day by the tweet:

I am quite exciting about it. Now ggtree is citable by doi:10.1111/2041-210X.12628.

In the review period, ggtree had already been cited by several papers:

1. https://doi.org/10.7717/peerj.976

Annotating phylogenetic tree with image files (i.e. secondary structure).

2. http://dx.doi.org/10.1016/j.meegid.2015.12.006

Highlight clades and annotate phylogenetic tree with genotype table.

3. http://dx.doi.org/10.3389%2Ffcimb.2016.00036

Annotate phylogenetic tree with binary matrix.

4. http:/​/​dx.​doi.​org/​10.​1104/​pp.​16.​00359

Annotate phylogenetic tree with sequence features.

5. http://dx.doi.org/10.1101/067207

Annotate with bootstrap values.

6. http://www.nature.com/articles/srep30158

Visualize bootstrap value infer by RAxML and phyml on the same tree.

7. http://dx.doi.org/10.1101/053371

8. http://dx.doi.org/10.1105/tpc.16.00053

9. http://dx.doi.org/10.1111/mec.13474

10. http://dx.doi.org/10.1371/journal.pone.0160958

11. http://dx.doi.org/10.1111/2041-210X.12593

12. http://dx.doi.org/10.1101/053371

The ggtree paper provides several comprehensive and reproducible examples. It can be a good reference for complex tree annotation. My slides in 9th China R conference summarized the functionalities that ggtree provided.

More information can be found in ggtree homepage. Hope you have fun with ggtree.

To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

14 new R jobs from around the world (2016-08-16)

Tue, 2016-08-16 04:19

Here are the new R Jobs for 2016-08-16.

To post your R job on the next post

Just visit this link and post a new R job to the R community. You can either post a job for free (which works great), or pay $50 to have your job featured (and get extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

New Featured Jobs
More New Jobs
  1. Freelance
    Instructor for teaching a 5-day Course on data manipulation and visualization in R @ Berlin
    Physalia-courses – Posted by Physalia-courses
    Berlin
    Berlin, Germany
    13 Aug2016
  2. Full-Time
    R Package Developer for The Center for Tropical Forest Science @ Washington, U.S.
    Center for Tropical Forest Science – Forest Global Earth Observatory (CTFS-ForestGEO) – Posted by krizell
    Washington
    District of Columbia, United States
    12 Aug2016
  3. Full-Time
    Assistant/Associate Professor in Innovative Qualitative Methodologies – 003539-2016 Cluster Hire
    Purdue University – Posted by Purdueclusterhiresearch
    West LafayetteIndiana, United States
    11 Aug2016
  4. Full-Time
    Assistant/Associate Professor in Innovative Quantitative Methodologies – 003538-2016 Cluster HirePurdue University – Posted by Purdueclusterhiresearch
    West Lafayette
    Indiana, United States
    11 Aug2016
  5. Full-Time
    Senior Finance Manager at Amazon @ Seattle, Washington, United States
    Amazon – Posted by Ryan
    Seattle
    Washington, United States
    9 Aug2016
  6. Part-Time
    Shiny developer
    ironAPI – Posted by Babak
    Wien
    Wien, Austria
    8 Aug2016
  7. Full-Time
    Data Science Faculty (Assistant/Associate Professor, Tenure Track)
    Saint Louis University – Posted by armbrees
    St. Louis
    Missouri, United States
    7 Aug2016
  8. Full-Time
    Data Scientists for IDATA SAS @ Medellín, Antioquia, Colombia
    IDATA SAS – Posted by vmhoyos
    Medellín
    Antioquia, Colombia
    5 Aug2016
  9. Full-Time
    R Software Developer for University of Maryland Libraries
    University of Maryland Libraries – Posted by dywong
    College Park
    Maryland, United States
    4 Aug2016
  10. Full-Time
    Data Visualiser
    CensorNet – Posted by Stephanie Locke
    Emersons Green
    England, United Kingdom
    4 Aug2016
  11. Full-Time
    Data Scientist/Modeling – Applied Technology
    Crowe Horwath – Posted by Neil.Schneider
    Indianapolis
    Indiana, United States
    3 Aug2016
  12. Full-Time
    Sr. Analyst – Client Insights/Design
    Maritz Motivation Solutions – Posted by maritzrecruiter
    Fenton
    Missouri, United States
    2 Aug2016
  13. Full-Time
    Data Scientist / Bioinformatician / Statistician at University of California
    UCSF – Posted by clif.duhn
    San Francisco
    California, United States
    2 Aug2016
  14. Freelance
    Big Data in Digital Health (5-10 hours per week)
    MedStar Intitute for Innovation – Posted by Praxiteles
    Anywhere
    1 Aug2016

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 200 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

Categories: Methodology Blogs

The Win-Vector parallel computing in R series

Tue, 2016-08-16 03:57

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.

For your convenience here they are in order:

  1. A gentle introduction to parallel computing in R
  2. Running R jobs quickly on many machines
  3. Can you nest parallel operations in R?

Please check it out, and please do Tweet/share these tutorials.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

RQGIS 0.1.0 release

Tue, 2016-08-16 01:50

(This article was first published on R – jannesm, and kindly contributed to R-bloggers)

We proudly announce the release of RQGIS! RQGIS establishes an interface between R and QGIS, i.e. it allows the user to access the more than 1000 QGIS geoalgorithms from within R. To install it, run:

install.packages("RQGIS")

Naturally, you need to install other software (such as QGIS) to run RQGIS properly. Therefore, we wrote a package vignette, which guides you through the installation process of QGIS, GRASS and SAGA on various platforms. To access it, run:

vignette("install_guide", package = "RQGIS") How to use RQGIS

To introduce RQGIS, we will demonstrate how to calculate the SAGA topographic wetness index. The first step is the creation of a QGIS environment. This is a list containing the paths RQGIS needs to access QGIS. Luckily for us, set_env does this for us. We merely need to specify the root path to the QGIS installation. This is most likely something like C:/OSGeo4W~1 on Windows, /usr on Linux and /usr/local/Cellar on a Mac. If we do not specify a path to the QGIS root folder, set_env tries to figure it out. This, however, might be time-consuming depending on the size of the drives to search.

library("RQGIS") env <- set_env()

Secondly, we need to find out the name of the function in QGIS which calculates the wetness index:

find_algorithms("wetness index", name_only = TRUE, qgis_env = env) ## [1] "taudem:topographicwetnessindex" "saga:sagawetnessindex" ## [3] "saga:topographicwetnessindextwi" ""

There are three algorithms containing the words wetness and index in their short description. Here, we choose saga:sagawetnessindex. To retrieve the corresponding function arguments, we use get_args_man. By setting option to TRUE, we indicate that we would like to use the default values, if available:

args <- get_args_man(alg = "saga:sagawetnessindex", options = TRUE, qgis_env = env) # print the retrieved list args ## $DEM ## [1] "None" ## ## $SUCTION ## [1] "10.0" ## ## $AREA_TYPE ## [1] "0" ## ## $SLOPE_TYPE ## [1] "0" ## ## $SLOPE_MIN ## [1] "0.0" ## ## $SLOPE_OFF ## [1] "0.1" ## ## $SLOPE_WEIGHT ## [1] "1.0" ## ## $AREA ## [1] "None" ## ## $SLOPE ## [1] "None" ## ## $AREA_MOD ## [1] "None" ## ## $TWI ## [1] "None"

Of course, we need to specify certain function arguments such as the input (DEM) and output (TWI) arguments. Please note that RQGIS accepts as input argument either the path to a spatial object or a spatial object residing in R. Here, we will use a digital elevation model, which comes with the RQGIS package:

# load data into R data("dem", package = "RQGIS") # define input args$DEM <- dem # specify output path args$TWI <- "twi.asc"

Finally, we can access QGIS from within R by supplying run_qgis with the specified arguments as a list. Specifying also the load_output-argument directly loads the QGIS output into R.

twi <- run_qgis(alg = "saga:sagawetnessindex", params = args, load_output = args$TWI, qgis_env = env) # visualize the result library("raster") hs <- hillShade(terrain(dem), terrain(dem, "aspect"), 40, 270) pal <- RColorBrewer::brewer.pal(6, "Blues") spplot(twi, col.regions = pal, alpha.regions = 0.8, scales = list(tck = c(1, 0)), colorkey = list(space = "bottom", width = 1, height = 0.5, axis.line = list(col = "black")), cuts = length(pal) - 1) + latticeExtra::as.layer(spplot(hs, col.regions = gray(0:100 / 100)), under = TRUE)

 

For more information on RQGIS, please refer to https://github.com/jannes-m/RQGIS.

 

 

To leave a comment for the author, please follow the link and comment on their blog: R – jannesm. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

15 Page Tutorial for R

Mon, 2016-08-15 21:05

(This article was first published on R – StudyTrails, and kindly contributed to R-bloggers)

For Beginners in R, here is a 15 page example based tutorial that covers the basics of R.

  1. Starting R – Trivial tutorial on how to start R for those just wondering what to do next after downloading R.
  2. Assignement Operator – Two important assignment operators in R are <- and =
  3. Listing Objects – All entities in R are called objects. They can be arrays, numbers, strings, functions. This tutorial will cover topics such as listing all objects, listing object from a specific environment and listing objects that satisfy a particular pattern.
  4. Sourcing R File – R code can also be written in a file and then the file can be called from the R code.
  5. Basic Datastructures in R – Understanding data structures is probably the most important part of learning R. This tutorial covers vector and list. It also covers subsetting.
  6. Data Structures in R, Matrix and Array – Covers matrix and vectors. An array is a vector with additional attributes dim which stores the dimension of the array and dimnames which stores the names of the dimensions. A matrix is an 2 dimensional array. Head to the tutorial for examples of both.
  7. Data Structures in R, factors and Data Frame – DataFrames are probably the most widely used data structure. It would help to just go through the examples and practice them. The tutorial covers important operations on the data frame and factors as well as subsetting data frames.
  8. Data Structures in R, Data Frame Operations – Covers some more operations on the data frame; including stack, attach, with, within, transform, subset, reshape and merge
  9. Control Structures in R – The basics of any programming language. Control loops allow looping through data structures. The tutorial covers if, if-else, for, while, next, break, repeat and switch
  10. Control Structures in R – apply – To make looping more efficient R has introduced a family of ‘apply’ functions. For example – the apply function can be used apply a function over specific elements of an array (or matrix). The tutorial covers lapply, sapply, apply, tapply.
  11. Control Structures in R – apply 2 – We continue with some more apply functions – mapply and by.
  12. Functions in R – The nuts and bolts of any programming language. This tutorial not only explains the concept of functions using examples but also covers various scenarios such as anonymous functions or passing functions around.
  13. Printing on Console in R – Printing on console can come very handy. The tutorial covers the print and cat functions as well as printing data frames.
  14. Pretty printing using Format function in R – This tutorial looks at how to use the formatting functions for pretty printing.
  15. Reshape and Reshape2 Package – Once you start working on real life problems in R, a lot of time would be spent on manipulating data. Reshape and Reshape2 package will prove very powerful in converting data to the format required by other libraries. This tutorial has detailed examples to explain the package.

These tutorials are designed for beginners in R, but they can also be used by experienced programmers as a refresher course or as reference. Running loops in R can be slow and therefore the apply group of functions as well as the reshape package can drastically improve the performance of the code.

We hope you enjoy the tutorials.

To leave a comment for the author, please follow the link and comment on their blog: R – StudyTrails. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

tidyr 0.6.0

Mon, 2016-08-15 15:53

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I’m pleased to announce tidyr 0.6.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:

install.packages("tidyr")

I mostly released this version to bundle up a number of small tweaks needed for R for Data Science. But there’s one nice new feature, contributed by Jan Schulz: drop_na(). drop_na()drops rows containing missing values:

df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b")) df #> # A tibble: 3 × 2 #> x y #> <dbl> <chr> #> 1 1 a #> 2 2 <NA> #> 3 NA b # Called without arguments, it drops rows containing # missing values in any variable: df %>% drop_na() #> # A tibble: 1 × 2 #> x y #> <dbl> <chr> #> 1 1 a # Or you can restrict the variables it looks at, # using select() style syntax: df %>% drop_na(x) #> # A tibble: 2 × 2 #> x y #> <dbl> <chr> #> 1 1 a #> 2 2 <NA>

Please see the release notes for a complete list of changes.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Probably the most useful R function I’ve ever written

Mon, 2016-08-15 15:19

(This article was first published on R language – Burns Statistics, and kindly contributed to R-bloggers)

The function in question is scriptSearch. I’m not much for superlatives — “most” and “best” imply one dimension, but we live in a multi-dimensional world. I’m making an exception.

The statistic I have in mind for this use of “useful” is the waiting time between calls to the function divided by the human time saved by the call.

I wrote a version of this for a company where I do consulting. There are few days working there that I don’t have at least one bout with it.  Using scriptSearch can easily save half an hour compared to what I would have done prior to having the function.

scriptSearch

The two main inputs are:

  • a string to search for
  • a directory to search in

By default it only looks in R scripts in the directory (and its subdirectories).

Examples of directories to search are:

  • directory holding a large collection of R scripts
  • directory holding the source for local R packages
  • personal directory with lots of subdirectories containing R scripts and functions

Examples of uses are:

  • where is blimblam defined?
  • where are all the uses of splishsplash in the local packages (because I want to change its arguments)?
  • a few weeks ago I created a pdf called factor_history, where is the code that produced that?

These uses might be done with something like:

  • scriptSearch("blimblam *<-", "path/to/scriptFarm", sub=FALSE)
  • scriptSearch("splishsplash", "path/to/Rsource")
  • scriptSearch("factor_history", "..")

You may be confused by the asterisk in the first call.  The string to search for can be a regular expression.  In this case the asterisk means that it will find assignments whether or not there is a space between the object name and the assignment arrow.

BurStMisc

scriptSearch was the main motivation for updating the BurStMisc package to version 1.1.  The package is on CRAN.

ntile

The ntile function is also new to BurStMisc.  It returns equally-sized ordered groups from a numeric vector — for example, quintiles or deciles.

A more primitive version of the function appeared in a blog post called “Miles of iles”.  There is some discussion there of alternative functions.

writeExpectTest

While I was preparing the update to BurStMisc, I found that automating the writing of some tests using the testthat package was both warranted and feasible.  The writeExpectTest function is the result.

corner

The generally useful function that was already in BurStMisc is corner.  This allows you to see a few rows and columns of a large matrix or data frame — or higher dimensional array.

Epilogue

I want to spread the news
That if it feels this good getting used
You just keep on using me

— from “Use Me”  by Bill Withers

The post Probably the most useful R function I’ve ever written appeared first on Burns Statistics.

To leave a comment for the author, please follow the link and comment on their blog: R language – Burns Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The inexorable growth of student debt, charted with R

Mon, 2016-08-15 15:15

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Len Kiefer, Deputy Chief Economist at Freddie Mac, recently published the following chart to his personal blog showing household debt in the United States (excluding mortgage debt). As you can see, student loan debt has steadily increased over the last 13 years and has now eclipsed all other forms of non-mortgage debt:

He also created this animated chart showing the growth (and occasional decline) of all forms of debt (including mortgages). All categories are scaled to the same nominal value in March 2003, and since that time student debt in the US has more than quintupled.

Both charts were created using the R language (the code is included in the blog post linked below). The data come from the New York Federal Reserve, which Robert reads into R using fread from the data.table package. The line charts were created using the ggplot2 package, with the ggrepel extension to keep the labels from overlapping. The animated version was created using the saveGIF function of the animation package.

For more charts (including some interesting by-state charts) and the complete details on the implementation, follow the link to Len's blog below.

Len Kiefer: Consumer Credit Trends (via Sharon Machlis)

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Can you nest parallel operations in R?

Mon, 2016-08-15 14:29

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

When we teach parallel programming in R we start with the basic use of parallel (please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics of parallel computing. Beginners really need a solid mental model of what services are really being provided by their tools and to test edge cases early.

One question that comes up over and over again is “can you nest parLapply?”

The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).

I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest parLapply, but that is a different issue than can one invent a work-around. For example: a “yes” answer (really meaning there are work-arounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator %.% which presumably handles working around nesting issues in parLapply).

Let’s set up a concrete example, so we can ask and answer a precise question. Suppose we have a list of jobs (coming from an external source) that we will simulate with the code fragment below.

jobs <- list(1:3,5:10,100:200)

Notice the jobs have wildly diverging sizes, this is an important consideration.

Suppose the task we want to perform is some the square roots of the entries. The standard (non-parallel) calculation would look like the following.

worker1 <- function(x) { sum(sqrt(x)) } lapply(jobs,worker1) ## [[1]] ## [1] 4.146264 ## ## [[2]] ## [1] 16.32201 ## ## [[3]] ## [1] 1231.021

For didactic purposes please pretend that the sum function is very expensive and the sqrt function is somewhat expensive.

If it was obvious we always had a great number of small sub-lists we would want to use parallelization to make sure we are performing many sums at the same time. We would then parallelize over the first level as below.

clus <- parallel::makeCluster(4) parallel::parLapplyLB(clus,jobs,worker1) ## [[1]] ## [1] 4.146264 ## ## [[2]] ## [1] 16.32201 ## ## [[3]] ## [1] 1231.021

Notice that parallel::parLapplyLB uses almost the same calling convention as lapply and returns the exact same answer.

If it was obvious we had a single large sub-list we would want to make sure we were always parallelizing the sqrt operations so we would prefer to parallelize as follows:

mkWorker2 <- function(clus) { force(clus) function(x) { xs <- parallel::parLapplyLB(clus,x,sqrt) sum(as.numeric(xs)) } } worker2 <- mkWorker2(clus) lapply(jobs,worker2) ## [[1]] ## [1] 4.146264 ## ## [[2]] ## [1] 16.32201 ## ## [[3]] ## [1] 1231.021

(For the details of building functions and passing values to remote workers please see here.)

If we were not sure if in the future what structure we would encounter we would prefer to schedule all operations for possible parallel execution. This would minimize the number of idle resources and minimize the time to finish the jobs. Ideally that would look like the following (a nested use of parallel):

parallel::parLapplyLB(clus,jobs,worker2) ## Error in checkForRemoteErrors(val): 3 nodes produced errors; first error: invalid connection

Notice the above fails with an error. Wishing for flexible code is what beginners intuitively mean when they as if you can nest parallel calls. They may not be able to explain it, but they are worried they don’t have a good characterization of the work they are trying to parallelize over. They are not asking if things get magically faster by “parallelizing parallel.”

It isn’t too hard to find out what the nature of the error is: the communication connection socket file descriptors (con) are passed as integers to each machine, but they are not valid descriptors where they arrive (they are just integers). We can see this by looking at the structure of the cluster:

str(clus) ## List of 4 ## $ :List of 3 ## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 5 ## .. .. ..- attr(*, "conn_id")=<externalptr> ## ..$ host: chr "localhost" ## ..$ rank: int 1 ## ..- attr(*, "class")= chr "SOCKnode" ## $ :List of 3 ## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 6 ## .. .. ..- attr(*, "conn_id")=<externalptr> ## ..$ host: chr "localhost" ## ..$ rank: int 2 ## ..- attr(*, "class")= chr "SOCKnode" ## $ :List of 3 ## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 7 ## .. .. ..- attr(*, "conn_id")=<externalptr> ## ..$ host: chr "localhost" ## ..$ rank: int 3 ## ..- attr(*, "class")= chr "SOCKnode" ## $ :List of 3 ## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 8 ## .. .. ..- attr(*, "conn_id")=<externalptr> ## ..$ host: chr "localhost" ## ..$ rank: int 4 ## ..- attr(*, "class")= chr "SOCKnode" ## - attr(*, "class")= chr [1:2] "SOCKcluster" "cluster" mkWorker3 <- function(clus) { force(clus) function(x) { as.character(clus) } } worker3 <- mkWorker3(clus) parallel::parLapplyLB(clus,jobs,worker3) ## [[1]] ## [1] "list(con = 5, host = \"localhost\", rank = 1)" ## [2] "list(con = 6, host = \"localhost\", rank = 2)" ## [3] "list(con = 7, host = \"localhost\", rank = 3)" ## [4] "list(con = 8, host = \"localhost\", rank = 4)" ## ## [[2]] ## [1] "list(con = 5, host = \"localhost\", rank = 1)" ## [2] "list(con = 6, host = \"localhost\", rank = 2)" ## [3] "list(con = 7, host = \"localhost\", rank = 3)" ## [4] "list(con = 8, host = \"localhost\", rank = 4)" ## ## [[3]] ## [1] "list(con = 5, host = \"localhost\", rank = 1)" ## [2] "list(con = 6, host = \"localhost\", rank = 2)" ## [3] "list(con = 7, host = \"localhost\", rank = 3)" ## [4] "list(con = 8, host = \"localhost\", rank = 4)"

What we are getting wrong is: we can’t share control of the cluster to each worker just by passing the cluster object around. This would require some central registry and call-back scheme (which is one of the things packages like foreach and doParallel accomplish when they “register a parallel back-end to use”). Base parallel depends more on explicit reference to the cluster data structure, so it isn’t “idiomatic parLapply” to assume we can find “the parallel cluster” (there could in fact be more than one at the same time).

So what is the work around?

One work around is to move to sophisticated wrappers (like doParallel or even future, also see here).

Another fix is to (at the cost of time effort and space) to re-organize the calculation into two sequenced phases, each of which is parallel- but not nested. It is a bit involved, but we show how to do that below (using R’s Reduce and split functions to reorganize the data, though one could also use so-called “tidyverse” methods).

# Preparation 1: collect all items into one flat list sqrtjobs <- as.list(Reduce(c,jobs)) # Phase 1: sqrt every item in parallel sqrts <- parallel::parLapplyLB(clus,sqrtjobs,sqrt) # Preparation 2: re-assemble new job list that needs only sums lengths <- vapply(jobs,length,numeric(1)) pattern <- lapply(seq_len(length(lengths)), function(i) {rep(i,lengths[[i]])}) pattern <- Reduce(c,pattern) sumjobs <- split(sqrts,Reduce(c,pattern)) sumjobs <- lapply(sumjobs,as.numeric) names(sumjobs) <- names(jobs) # Phase 2: sum all items in parallel parallel::parLapplyLB(clus,sumjobs,sum) ## [[1]] ## [1] 4.146264 ## ## [[2]] ## [1] 16.32201 ## ## [[3]] ## [1] 1231.021

In conclusion: you can’t directly nest parLapply, but you can usefully sequence through it.

parallel::stopCluster(clus)

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Dates and Times – Simple and Easy with lubridate exercises (part 1)

Mon, 2016-08-15 13:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


As in any programming language, handling date and time variables can be quite frustrating, since, for example, there is no one single format for dates, there are different time zones and there are issues such as daylight saving time.

Base R provides several packages for handling date and time variables, but they require mastering cumbersome syntax.

In order to solve all those issues and more, R package “lubridate” was created. This package on one hand has a very easy and intuitive syntax and on the other hand has functions that cover a wide range of issues related to dates and times.

In this first part in the series of lubridate exercises, the basic functionality of the package is covered.

As always, let’s start by downloading and installing the package:

install.packages("lubridate")
library(lubridate)

Answers to the exercises are available here.
If you have different solutions, feel free to post them.

Parsing Date-times


The ymd() series of functions are used to parse character strings into dates.
The letters y, m, and d correspond to the year, month, and day elements of a date-time.

Exercise 1
Populate a variable called “start_date” with a date representation of string “23012017”

Exercise 2
Use the lubridate function today to print the current date

Exercise 3
Extract the year part from the “start_date” variable created on exercise 1

Exercise 4
Extract the month part from the “start_date” variable created on exercise 1

Exercise 5
Extract the day part from the “start_date” variable created on exercise 1

Exercise 6
Set the month in variable “start_date” to February

Exercise 7
Add 6 days to variable “start_date”.
Did you notice what happened to the month value?

Exercise 8
Substract 3 months from variable “start_date”

Exercise 9 (Advanced)
Populate a field called concatenated_dates with a vector of dates containing the following values:
“31.12.2015”, “01.01.2016”, “15.02.2016”

Exercise 10 (Advanced)
Calculate in a short and simple way the addition of 1 thru 10 days to “start_date” variable

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

colourpicker: A colour picker widget for Shiny apps, RStudio, R-markdown, and ‘htmlwidgets’

Mon, 2016-08-15 13:00

(This article was first published on Dean Attali's R Blog, and kindly contributed to R-bloggers)

Have you ever wanted to allow your users to select colours in your Shiny apps? Have you ever wanted to select a few colours to use in your R code, but found it tedious to search for the right colours? If you answered yes to any of those questions, or if you’re just curious, then colourpicker is the package for you!

The new colourpicker package gives you a colour picker widget that can be used in different contexts in R. Most of the functionality has existed in the shinyjs package for the past year and this package is simply a way to graduate all the colour picker functions into their own package.

Demos
  • Click here to view a live
    interactive demo of the colour picker input available for Shiny apps.
  • Click here to see the colour picker addin that lets you select colours interactively.
Table of contents Installation

colourpicker is available through both CRAN and GitHub:

To install the stable CRAN version:

install.packages("colourpicker")

To install the latest development version from GitHub:

install.packages("devtools") devtools::install_github("daattali/colourpicker") Overview of main functions

colourpicker exposes three functions: colourInput() (for Shiny apps), colourPicker() (RStudio addin), and colourWidget() (an htmlwidget).

In Shiny apps (or R markdown): colourInput()

You can use colourInput() to include a colour picker input in Shiny apps (or in R markdown documents). It works just like any other native Shiny input, here is an example:

library(shiny) shinyApp( ui = fluidPage( colourInput("col", "Select colour", "purple"), plotOutput("plot") ), server = function(input, output) { output$plot <- renderPlot({ set.seed(1) plot(rnorm(50), bg = input$col, col = input$col, pch = 21) }) } )

To select colours to use in your R code: colourPicker()

colourpicker also provides an RStudio addin that can be used to easily select colours and save them as a variable in R. This can be useful if, for example, you want to pick some colours for a plot and you want an easy way to visualize and select a few colours. Here is a screenshot of the colour picker addin (you can either access this tool using the Addins menu or with colourPicker()). You can also watch a short GIF of it an action.

As an ‘htmlwidgets’ widget

The colour picker input is also available as an ‘htmlwidgets’ widget using the colourWidget() function. This may not be terribly useful right now since you can use the more powerful colourInput in Shiny apps and Rmarkdown documents, but it may come in handy if you need a widget.

Features of colourInput() Simple and familiar

Using colourInput is extremely trivial if you’ve used Shiny, and it’s as easy to use as any other input control. It was implemented to very closely mimic all other Shiny inputs so that using it will feel very familiar. You can add a simple colour input to your Shiny app with colourInput("col", "Select colour", value = "red"). The return value from a colourInput is an uppercase HEX colour, so in the previous example the value of input$col would be #FF0000 (#FF0000 is the HEX value of the colour red). The default value at initialization is white (#FFFFFF).

Allowing “transparent”

Since most functions in R that accept colours can also accept the value “transparent”, colourInput has an option to allow selecting the “transparent” colour. By default, only real colours can be selected, so you need to use the allowTransparent = TRUE parameter. When this feature is turned on, a checkbox appears inside the input box.

If the user checks the checkbox for “transparent”, then the colour input is grayed out and the returned value of the input is transparent. This is the only case when the value returned from a colourInput is not a HEX value. When the checkbox is unchecked, the value of the input will be the last selected colour prior to selecting “transparent”.

By default, the text of the checkbox reads “Transparent”, but you can change that with the transparentText parameter. For example, it might be more clear to a user to use the word “None” instead of “Transparent”. Note that even if you change the checkbox text, the return value will still be transparent since that’s the actual colour name in R.

This is what a colour input with transparency enabled looks like

How the chosen colour is shown inside the input

By default, the colour input’s background will match the selected colour and the text inside the input field will be the colour’s HEX value. If that’s too much for you, you can customize the input with the showColour parameter to either only show the text or only show the background colour.

Here is what a colour input with each of the possible values for showColour looks like

Limited colour selection

If you want to only allow the user to select a colour from a specific list of colours, rather than any possible HEX colour, you can use the palette = "limited" parameter. By default, the limited palette will contain 40 common colours, but you can supply your own list of colours using the allowedCols parameter. Here is an image of the default limited colour palette.

Updating a colourInput

As with all other Shiny inputs, colourInput can be updated with the updateColourInput function. Any parameter that can be used in colourInput can be used in updateColourInput. This means that you can start with a basic colour input such as colourInput("col", "Select colour") and completely redesign it with

updateColourInput(session, "col", label = "COLOUR:", value = "orange", showColour = "background", allowTransparent = TRUE, transparentText = "None") Flexible colour specification

Specifying a colour to the colour input is made very flexible to allow for easier use. When giving a colour as the value parameter of either colourInput or updateColourInput, there are a few ways to specify a colour:

  • Using a name of an R colour, such as red, gold, blue3, or any other name that R supports (for a full list of R colours, type colours())
  • If transparency is allowed in the colourInput, the value transparent (lowercase) can be used. This will update the UI to check the checkbox.
  • Using a 6-character HEX value, either with or without the leading #. For example, initializing a colourInput with any of the following values will all result in the colour red: ff0000, FF0000, #ff0000.
  • Using a 3-character HEX value, either with or without the leading #. These values will be converted to full HEX values by automatically doubling every character. For example, all the following values would result in the same colour: 1ac, #1Ac, 11aacc.
Works on any device

If you’re worried that maybe someone viewing your Shiny app on a phone won’t be able to use this input properly – don’t you worry. I haven’t quite checked every single device out there, but I did spend extra time making sure the colour selection JavaScript works in most devices I could think of. colourInput will work fine in Shiny apps that are viewed on Android cell phones, iPhones, iPads, and even Internet Explorer 8+.

As usual, if you have any comments, feel free to contact me.

To leave a comment for the author, please follow the link and comment on their blog: Dean Attali's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

rfoaas 1.0.0

Sun, 2016-08-14 14:16

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The big 1.0.0 is here! Following the unsurpassed lead of the FOAAS project, we have arrived at a milestone: Release 1.0.0 is now on CRAN.

The rfoaas package provides an interface for R to the most excellent FOAAS service–which itself provides a modern, scalable and RESTful web service for the frequent need to tell someone to f$#@ off.

Release 1.0.0 brings fourteen (!!) new access points: back(), bm(), gfy(), think(), keep(), single_(), look(), looking(), no(), give(), zero(), pulp(), sake(), and anyway(). All with documentation and testing.

Even more neatly, thanks to a very pull request by Tyler Hunt, we can now issue random FOs via the getRandomFO() function!

As usual, CRANberries provides a diff to the previous CRAN release. Questions, comments etc should go to the GitHub issue tracker. More background information is on the project page as well as on the github repo

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Chi-Squared Test

Sun, 2016-08-14 14:09

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

Before we build stats/machine learning models, it is a good practice to understand which predictors are significant and have an impact on the response variable.

In this post we deal with a particular case when both your response and predictor are categorical variables.

By the end of this you’d have gained an understanding of what predictive modelling is and what the significance and purpose of chi-square statistic is. We will go through a hypothetical case study to understand the math behind it. We will actually implement a chi-squared test in R and learn to interpret the results. Finally you’ll be solving a mini challenge before we discuss the answers.

  • Background Knowledge
  • Case Study – Effectiveness of a drug treatment
  • Purpose and math behind Chi-Sq statistic
  • Chi-Sq Test
  • R Code
  • Mini-Challenge
Background knowledge – Predictive modelling

For the sake of completeness, let’s begin by understanding how predictive modelling works so you can better appreciate the significance of chi-squared tests and how it fits in the process.

Predictive modelling is a technique where we use statistical modelling or machine learning algorithms to predict a response variable/s based on one or more predictors.

The predictors are typically features that influence the response in some way. The models work best if the features are meaningful and have a significant relationship with the response.

But, you normally wouldn’t know beforehand if the response is dependent on a given feature or not. We can use the chi-squared test to determine if they are dependent or not, provided, both response and predictors are categorical variables.

Hypothetical Example: Effectiveness of a Drug Treatment

Let’s consider a hypothetical case where we test the effectiveness of a drug for a certain medical condition.

Suppose we have 105 patients under study and 50 of them were treated with the drug. The remaining 55 patients were kept as control samples. The health condition of all patients was checked after a week.

The following table shows if their condition improved or not. Just by looking at it, can you tell if the drug had a positive effect on the patients.

As you can see, 35 out of the 50 patients showed improvement. Suppose if the drug had no effect, the 50 would have been split the the same proportion as the patients who were not given the treatment. But in this case, about 70% of patients showed improvement, which is significantly higher than the control case.

Since both categorical variables have only 2 levels, it was sort of intuitive to day that the drug treatment and health condition are dependent. But, as the number of categories increase, we need to quantifiable statistic to definitively say if they are dependent or not.

One such a metric is the chi-squared statistic.

Chi-Squared Statistic

For sake of understanding, lets see how Chi-squared statistic is computed for the 2 by 2 case.
To begin with, we will assume that the 2 variables are not related to each other. In that is the case, can you tell what would the expected value of each cell? For example, the first cell will take the value: 50 times 75 by 105, which equals 35.7.

All the expected values can be computed this way (shown in brackets).

Once that is done, the Chi-Sq statistic is computed as follows.
$$\chi^2= \sum_{i=1}^{n} \frac{(O_i – E_i)^2}{E_i}$$
Numeric Computation
\(\ Chi-Sq = ((35-29.04)^2 / 29.04) + ((15-20.95)^2 / 20.95) + \)
\(\ ((26-31.95)^2 / 31.95) + ((29-23.04)^2/23.04) = 5.56 \)

This value will be larger if the difference between the actual and expected values widens.

Also, the more the categories in the variables the larger the chi-squared statistic should be.

Chi-Squared Test

In order to establish that 2 categorical variables are dependent, the chi-squared statistic should be above a certain cutoff. This cutoff increases as the number of classes within the variable increases.

Alternatively, you can just perform a chi-squared test and check the p-values.

Like all statistical tests, chi-squared test assumes a null hypothesis and an alternate hypothesis. The general practice is, if the p-value that comes out in the result is less than a pre-determined significance level, which is 0.05 usually, then we reject the null hypothesis.

H0: The The two variables are independent
H1: The two variables are related.

The null hypothesis of the chi-squared test is that the two variables are independent and the alternate hypothesis is that they are related.

R Code

Let’s work it out in R by doing a chi-squared test on the treatment (X) and improvement (Y) columns in treatment.csv
First, read in the treatment.csv data.

df <- read.csv("https://goo.gl/j6lRXD") table(df$treatment, df$improvement) improved not-improved not-treated 26 29 treated 35 15

Let’s do the chi-squared test using the chisq.test() function. It takes the two vectors as the input. We also set `correct=FALSE` to turn off Yates’ continuity correction.

# Chi-sq test chisq.test(df$treatment, df$improvement, correct=FALSE) Pearson's Chi-squared test data: df$treatment and df$improvement X-squared = 5.5569, df = 1, p-value = 0.01841

We have a chi-squared value of 5.55. Since we get a p-Value less than the significance level of 0.05, we reject the null hypothesis and conclude that the two variables are in fact dependent. Sweet!

Mini-Challenge

For this challenge, find out if the ‘cyl’ and ‘carb’ variables in ‘mtcars’ dataset are dependent or not. Go ahead, pause and try it out. I will be showing the answers in a few seconds. Good Luck!

So here is my answer:
Let’s have a look the table of mtcars$carb vs mtcars$cyl.

table(mtcars$carb, mtcars$cyl) 4 6 8 1 5 2 0 2 6 0 4 3 0 0 3 4 0 4 6 6 0 1 0 8 0 0 1

Since there are more levels, it’s much harder to make out if they are related. Let’s use the chi-squared test instead.

# Chi-sq test chisq.test(mtcars$carb, mtcars$cyl) Pearson's Chi-squared test data: mtcars$carb and mtcars$cyl X-squared = 24.389, df = 10, p-value = 0.006632

We have a high chi-squared value and a p-value of less that 0.05 significance level. So we reject the null hypothesis and conclude that carb and cyl have a significant relationship.

Congratulations of you got it right!

If you liked this post, you might find my latest video course ‘Introduction to R Programming’ to be quite resourceful. Happy Learning!

    Related Post

    1. Missing Value Treatment
    2. R for Publication by Page Piccinini
    3. Assessing significance of slopes in regression models with interaction
    4. First steps with Non-Linear Regression in R
    5. Standard deviation vs Standard error

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Simulating local community dynamics under ecological drift

    Sun, 2016-08-14 13:00

    (This article was first published on R – biologyforfun, and kindly contributed to R-bloggers)

    In 2001 the book by Stephen Hubbell on the neutral theory of biodiversity was a major shift from classical community ecology. Before this book the niche-assembly framework was dominating the study of community dynamics. Very briefly under this framework local species composition is the result of the resource available at a particular site and species presence or absence depends on species niche (as defined in Chase 2003). As a result community assembly was seen as a deterministic process and a specific set of species should emerge from a given set of resources and species composition should be stable as long as the system is not disturbed. Hubbell theory based on a dispersal-assembly framework (historically developed by MacArthur and Wilson) was a departure from this mainstream view of community dynamics by assuming that local species composition are transient with the absence of any equilibrium. Species abundance follow a random walk under ecological drift, species are continuously going locally extinct while a constant rain of immigrants from the metacommunity bring new recruits. The interested reader is kindly encouraged to read: this, that or that.

    Being an R geek and after reading Chapter 4 of Hubbell’s book, I decided to implement his model of local community dynamic under ecological drift. Do not expect any advanced analysis of the model behavior, Hubbell actually derive in this chapter analytical solutions for this model. This post is mostly educational to get a glimpse into the first part of Hubbell’s neutral model.

    Before diving into the code it is helpful to verbally outline the main characteristics of the model:

    • The model is individual-based and neutral, assuming that all individuals are equal in their probability to die or to persist.
    • The model assume a zero-sum dynamic, the total number of individual (J) is constant, imagine for example a forest. When a tree dies it allows for a new individual to grow and use the free space
    • At each time step a number of individual (D) dies and are replaced by new recruits coming either from the metacommunity (with probability m) or from the local communities (with probability 1-m)
    • The species identity of the new recruit is proportional to the species relative abundance either in the metacommunity (assumed to be fixed and having a log-normal distribution) or in the local community (which obviously varies)

    We are now equipped to immerse ourselves into R code:

    #implement zero-sum dynamic with ecological drift from Hubbel theory (Chp4) untb<-function(R=100,J=1000,D=1,m=0,Time=100,seed=as.numeric(Sys.time())){   mat<-matrix(0,ncol=R,nrow=Time+1,dimnames= + list(paste0("Time",0:Time),as.character(1:R)))   #the metacommunity SAD   metaSAD<-rlnorm(n=R)   metaSAD<-metaSAD/sum(metaSAD)   #initialize the local community   begin<-table(sample(x=1:R,size = J,replace = TRUE,prob = metaSAD))   mat[1,attr(begin,"dimnames")[[1]]]<-begin   #loop through the time and get community composition   for(i in 2:(Time+1)){     newC<-helper_untb(vec = mat[(i-1),],R = R,D = D,m = m,metaSAD=metaSAD)     mat[i,attr(newC,"dimnames")[[1]]]<-newC   }   return(mat) } #the function that compute individual death and #replacement by new recruits helper_untb<-function(vec,R,D,m,metaSAD){   ind<-rep(1:R,times=vec)   #remove D individuals from the community   ind_new<-ind[-sample(x = 1:length(ind),size = D,replace=FALSE)]   #are the new recruit from the Local or Metacommunity?   recruit<-sample(x=c("L","M"),size=D,replace=TRUE,prob=c((1-m),m))   #sample species ID according to the relevant SAD   ind_recruit<-c(ind_new,sapply(recruit,function(isLocal) {     if(isLocal=="L")       sample(x=ind_new,size=1)     else       sample(1:R,size=1,prob=metaSAD)   }))

    I always like simple theories that can be implemented with few lines of code and do not require fancy computations. Let’s put the function in action and get for a closed community (m=0) of 10 species the species abundance and the rank-abundance distribution dynamics:

    library(reshape2) #for data manipulation I library(plyr) #for data manipulation II library(ggplot2) #for the nicest plot in the world #look at population dynamics over time comm1<-untb(R = 10,J = 100,D = 10,m = 0,Time = 200) #quick and dirty first graphs comm1m<-melt(comm1) comm1m$Time<-as.numeric(comm1m$Var1)-1 ggplot(comm1m,aes(x=Time,y=value,color=factor(Var2))) +geom_path()+labs(y="Species abundance") +scale_color_discrete(name="Species ID") #let's plot the RAD curve over time pre_rad<-apply(comm1,1,function(x) sort(x,decreasing = TRUE)) pre_radm<-melt(pre_rad) pre_radm$Time<-as.numeric(pre_radm$Var2)-1 ggplot(pre_radm,aes(x=Var1,y=value/100,color=Time,group=Time)) +geom_line()+labs(x="Species rank",y="Species relative abundance")

    We see there an important feature of the zero-sum dynamics, with no immigration the model will always (eventually) lead to the dominance of one species. This is because with no immigration there are so-called absorbing states, species abundance are attracted to these states and once these values are reached species abundance stay constant. In this case there are only two absorbing states for species relative abundance: 0 or 1, either you go extinct or you dominate the community.

    In the next step we open up the community while varying the death rate and we track species richness dynamic:

    #apply this over a range of parameters pars<-expand.grid(R=20,J=200,D=c(1,2,4,8,16,32,64), + m=c(0,0.01,0.05,0.1,0.5),Time=200) comm_list<-alply(pars,1,function(p) do.call(untb,as.list(p))) #get species abundance for each parameter set rich<-llply(comm_list,function(x) apply(x,1,function(y) length(y[y!=0]))) rich<-llply(rich,function(x) data.frame(Time=0:200,Rich=x)) rich<-rbind.fill(rich) rich$D<-rep(pars$D,each=201) rich$m<-rep(pars$m,each=201) ggplot(rich,aes(x=Time,y=Rich,color=factor(D)))+geom_path() +facet_grid(.~m)

    As we increase the probability that new recruit come from the metacommunity, the impact of increasing the number of death per time step becomes null. This is because the metacommunity is static in this model, even if species are wiped out by important local disturbance there will always be new immigrants coming in maintaining species richness to rather constant levels.

    There is an R package for everything these days and the neutral theory is not an exception, check out the untb package which implement the full neutral model (ie with speciation and dynamic metacommunities) but also allow to fit the model to empirical data and to get model parameter estimate.

    Filed under: Biological Stuff, R and Stat Tagged: Biodiversity, Neutral theory, R

    To leave a comment for the author, please follow the link and comment on their blog: R – biologyforfun. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Gaussian predictive process models in Stan

    Sun, 2016-08-14 05:00

    (This article was first published on Maxwell B. Joseph, and kindly contributed to R-bloggers)

    Gaussian process (GP) models are computationally demanding for large datasets.
    Much work has been done to avoid expensive matrix operations that arise in parameter estimation with larger datasets via sparse and/or reduced rank covariance matrices (Datta et al. 2016 provide a nice review).
    What follows is an implementation of a spatial Gaussian predictive process Poisson GLM in Stan, following Finley et al. 2009: Improving the performance of predictive process modeling for large datasets, with a comparison to a full rank GP model in terms of execution time and MCMC efficiency.

    Generative model

    Assume we have $n$ spatially referenced counts $y(s)$ made at spatial locations $s_1, s_2, …, s_n$, which depend on a latent mean zero Gaussian process with an isotropic stationary exponential covariance function:

    y(s) \sim \text{Poisson}(\text{exp}(X^T(s) \beta + w(s)))

    w(s) \sim GP(0, C(d))

    [C(d)]_{i, j} = \eta^2 \text{exp}(- d_{ij} \phi)) + I(i = j) \sigma^2

    where $y(s)$ is the response at location $s$, $X$ is an $n \times p$ design matrix, $\beta$ is a length $p$ parameter vector, $\sigma^2$ is a “nugget” parameter, $\eta^2$ is the variance parameter of the Gaussian process, $d_{ij}$ is a spatial distance between locations $s_i$ and $s_j$, and $\phi$ determines how quickly the correlation in $w$ decays as distance increases.
    For simplicity, the point locations are uniformly distributed in a 2d unit square spatial region.

    To estimate this model in a Bayesian context, we might be faced with taking a Cholesky decomposition of the $n \times n$ matrix $C(d)$ at every iteration in an MCMC algorithm, a costly operation for large $n$.

    Gaussian predictive process representation

    Computational benefits of Gaussian predictive process models arise from the estimation of the latent Gaussian process at $m « n$ locations (knots).
    Instead of taking the Cholesky factorization of the $n \times n$ covariance matrix, we instead factorize the $m \times m$ covariance matrix corresponding to the covariance in the latent spatial process among knots.
    Knot placement is a non-trivial topic, but for the purpose of illustration let’s place knots on a grid over the spatial region of interest.
    Note that here the knots are fixed, but it is possible to model knot locations stochastically as in Guhaniyogi et al. 2012.

    Knots are shown as stars and the points are observations.

    We will replace $w$ above with an estimate of $w$ that is derived from a reduced rank representation of the latent spatial process.
    Below, the vector $\tilde{\boldsymbol{\epsilon}}$ corrects for bias (underestimation of $\eta$ and overestimation of $\sigma$) as an extension of Banerjee et al. 2008, and $\mathcal{C}^T(\theta) \mathcal{C}^{*-1}(\theta)$ relates $w$ at desired point locations to the value of the latent GP at the knot locations, where $\mathcal{C}^T(\theta)$ is an $n \times m$ matrix that gets multiplied by the inverse of the $m \times m$ covariance matrix for the latent spatial process at the knots.
    For a complete and general description see Finley et al. 2009, but here is the jist of the univariate model:

    \boldsymbol{Y} \sim \text{Poisson}(\boldsymbol{X} \beta + \mathcal{C}^T(\theta) \mathcal{C}^{*-1}(\theta)\boldsymbol{w^*} + \tilde{\boldsymbol{\epsilon}})

    \boldsymbol{w^*} \sim GP(0, \mathcal{C}^*(\theta))

    \tilde{\epsilon}(s) \stackrel{indep}{\sim} N(0, C(s, s) - \mathcal{c}^T(\theta) \mathcal{C}^{*-1}(\theta))\mathcal{c}(\theta)

    with priors for $\eta$, $\sigma$, and $\phi$ completing a Bayesian specification.

    This approach scales well for larger datasets relative to a full rank GP model.
    Comparing the number of effective samples per unit time for the two approaches across a range of sample sizes, the GPP executes more quickly and is more efficient for larger datasets.
    Code for this simulation is available on GitHub: https://github.com/mbjoseph/gpp-speed-test.

    Relevant papers

    Finley, Andrew O., et al. “Improving the performance of predictive process modeling for large datasets.” Computational statistics & data analysis 53.8 (2009): 2873-2884.

    Banerjee, Sudipto, et al. “Gaussian predictive process models for large spatial data sets.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.4 (2008): 825-848.

    Datta, Abhirup, et al. “Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets.” Journal of the American Statistical Association. Accepted (2015).

    Guhaniyogi, Rajarshi, et al. “Adaptive Gaussian predictive process models for large spatial datasets.” Environmetrics 22.8 (2011): 997-1007.

    To leave a comment for the author, please follow the link and comment on their blog: Maxwell B. Joseph. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Labeling Opportunities in Price Series

    Sat, 2016-08-13 22:42

    (This article was first published on R – Quintuitive, and kindly contributed to R-bloggers)

    One approach to trading which has been puzzling me lately, is to sit and wait for opportunities. Sounds simplistic, but it is indeed different than, for instance, the asset allocation strategies. In order to be able to even attempt taking advantage of these opportunities, however, we must be able to identify them. Once the opportunities are identified – we can try to explain (forecast) them using historical data.

    The first step is to define what an opportunity is. It could be anything, for instance we can start with a strong directional move, but then we need to define what a strong directional move is. Let’s give it a try.

    We can define a strong directional move as a 5% move over 3 days for instance. However, that’s not likely to work for different markets (think futures). 5% move is a lot in treasuries, but not so much in oil. The solution is to normalize the returns using volatility. I will use exponential moving average for that.

    Here is a short code snippet:

    adjust.returns = function(returns, vola.len=35) { vola = sqrt(EMA(returns * returns, n = vola.len)) return(returns/vola) } good.entries = function(ohlc, days.out=15, vola.len=35, days.pos=0.6) { stopifnot(days.out > 3) hi = Hi(ohlc) lo = Lo(ohlc) cl = Cl(ohlc) rets = ROC(cl, type='discrete') arets = abs(adjust.returns(rets, vola.len=vola.len)) res = rep(0, NROW(ohlc)) days = rep(0, NROW(ohlc)) for(ii in 3:days.out) { hh = lag.xts(runMax(hi, n=ii), -ii) ll = lag.xts(runMin(lo, n=ii), -ii) hi.ratio = (hh/cl - 1)/arets lo.ratio = (ll/cl - 1)/arets # days required dd = ceiling(days.pos*ii) longs = hi.ratio > dd & -lo.ratio < dd longs = ifelse(longs, 1, 0) shorts = -lo.ratio > dd & hi.ratio < dd shorts = ifelse(shorts, -1, 0) both = ifelse(longs == 1, 1, shorts) new.days = ii*as.numeric(res == 0 & both != 0) res = ifelse(res != 0, res, both) days = ifelse(days != 0, days, new.days) } return(data.frame(entry=res, days=days)) }

    The code basically defines an opportunity as a directional move over X days (X is between 3 and 15). The minimum move we would accept is computed by taking a percentage of the days (60% by default), rounding that number up, and multiplying it by the standard deviation of the normalized returns. We also require that the move in the opposite direction is smaller. Hard to explain, but hopefully you get the idea.

    The function returns a label for each day: 0, 1, or -1. The opportunities are the 1 and -1 days of course.

    Is this trade-able? Absolutely. When an opportunity arises, we enter in the suggested direction. The exit is a profit target, we can use a stop loss of the same magnitude.

    Why all this effort again? Simple – these labels can be used as the response, or the dependent variable in an attempt to explain (forecast) the opportunities using historical data. At this point what’s left is to generate what we think might constitute a good set of predictors, align the two sets appropriately (no data snooping) and the rest is standard machine learning.

    That’s for another post though.

    The post Labeling Opportunities in Price Series appeared first on Quintuitive.

    To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Tuning Apache Spark for faster analysis with Microsoft R Server

    Fri, 2016-08-12 15:12

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    My colleagues Max Kaznady, Jason Zhang, Arijit Tarafdar and Miguel Fierro recently posted a really useful guide with lots of tips to speed up prototyping models with Microsoft R Server on Apache Spark. These tips apply when using Spark on Azure HDInsight, where you can spin up a Spark cluster the cloud with Microsoft R installed on the head node and worker nodes with just a single request on the Azure portal:

    The blog post provides detailed instructions, but the three main efficiency tips are:

    Install RStudio Server on the head node. R Tools for Visual Studio doesn't yet support remote execution, and while you can interact with the R command line directly via SSH or Putty, having an IDE makes things much easier and faster. There's a simple guide to install RStudio Server on the head node, whcih you can then use remotely to drive both the head node and the worker nodes.

    Use just the head node for iterating on your model. While you probably want to use all of the data to train your final model, you can speed up the iterative process of developing your model (selecting variables, creating features and transformations, and comparing performance of different model types) by working with a sample of your data on the head node only. While this mightn't have the predictive power of using all the data, it's more than sufficient to determine variable importance and compare performance between competing models. Ultimately, the more iterations you have time for when developing the model, the better the final production model (trained on all the data) will be.

    Tune the Spark cluster to optimize performance for model training. Now that you've saved time on the model development, save some time for the full-data model training by tuning the Spark cluster. (This is particularly useful if the production model is going to be retrained on a regular basis, say on a weekly or overnight schedule.) Max and Arjit provide some guidelines for setting the on-heap and off-heap memory settings for each executor to maintain the ideal ratio.

    You can read the full details at the Azure blog post linked below, and you can try tuning a sample analysis on the Taxi dataset yourself, using the R code provided on GitHub.

    Azure HDInsight blog: Rapid Big Data Prototyping with Microsoft R Server on Apache Spark: Context Switching & Spark Tuning

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Combinations Exercises

    Fri, 2016-08-12 13:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    When doing data analysis it happens often that we have a set of values and want to obtain various possible combinations of them. For example, taking 5 random samples from a dataset of 20. How many possible 5-sample sets are there and how to obtain all of them? R has a bunch of functions that help with tasks like these: expand.grid, combn, outer and choose.

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1
    You first throw a coin with 2 possible outcomes: it either lands heads or tails. Then you throw a dice with 6 possible outcomes: 1,2,3,4,5 and 6.
    Generate all possible results of your action.

    Exercise 2
    Generate a multiplication table for numbers ranging from 1 to 10.

    Exercise 3
    You have a set of card values:
    values <- c("A", 2, 3, 4, 5, 6, 7, 8, 9, 10, "J", "Q", "K")
    and a set of suits representing diamonds, clubs, spades and hearths:
    suits <- c("d", "c", "s", "h")
    Generate a deck of playing cards. (e.g. King of spades should be represented as Ks).

    Note: function paste(..., sep="") can be used to combine characters.

    Exercise 4
    Think about a game of poker using a standard deck of cards like the one generated earlier. Starting hand in poker consists of 5 cards and their order does not matter. How many different starting hands are there in total?

    Exercise 5
    You have a set of colors to choose from:
    colors <- c("red", "blue", "green", "white", "black", "yellow")
    You have to pick 3 colors and you cant’ pick the same color more than once. List all possible combinations.

    Exercise 6
    Using the same set of colors – pick 3 without picking the same more than once, just like in the previous exercise.
    List all possible combinations but this time sort each combination alphabetically.

    Exercise 7
    You have the same choices of colors and have to pick 3 but this time you can pick the same color more than once.
    List all possible combinations.

    Exercise 8
    You have the same set of colors but this time instead of having to pick 3 you can choose to pick either 1, 2 or 3.
    How many different choices can you make?

    Exercise 9
    You have the same set of colors and you can choose to pick either 1, 2 or 3.
    Make a list of all possible choices.

    Exercise 10
    There are 3 color palletes: the first one has 4 colors, the second has 6 colors and the third has 8 colors. You have to pick a pallete and then choose up to 5 (1, 2, 3, 4 or 5) colors from the chosen color pallete. How many different possibilities are there?

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Project package libraries and reproducibility

    Fri, 2016-08-12 12:39

    (This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

    Gábor Csárdi, Consultant, Mango Solutions

    Introduction

    If you are an R user it has probably happened to you that you upgraded some R package in your R installation, and then suddenly your R script or application stopped working.

    R packages change over time, and often they are not backwards compatible. This is normal and needed for innovation. But it is also annoying for end users. What can R users do to solve this problem?

    Project package libraries

    One strategy is that you create a new package library for a new project. A package library is just a directory that holds all installed R packages. (In addition to the ones that are installed with R itself.) There are various ways to configure R to use a separate package library, e.g. by setting the R_LIBS environment variable, or via the .libPaths()  function. (Read more about this on the .libPaths() manual page.

    A separate package library for each project ensures that updating packages for one project does not affect the well-being of another project, so this is a good first step. It also allows using different versions of the same R packages in different projects.

    Once the project is done and works well in this specific environment (or the deadline is over), you need to make sure that you do not update your R packages in the project library any more, not giving any chance for your application to break.

    The pkgsnap package

    It often happens, that you need to transfer your application to another computer, e.g. from the development environment to production, or you just want to share it with another developer. This means creating a project package library in the new environment, and then installing the same versions of the same packages from CRAN.

    This is surprisingly challenging using the tools in base R. R’s package management tools are very robust, and great at installing the latest versions of the packages reliably. But installing certain package versions, and certain versions of their dependencies is not something they can do, at least not directly.

    This is why we created the pkgsnap tool. This is a very simple package with two exported functions:

    • snap takes a snapshot of your project library. It writes out the names and versions of the currently installed packages into a text file. You can put this text file into the version control repository of the project, to make sure it is not lost.
    • restore uses the snapshot file to recreate the package project library from scratch. It installs the recorded versions of the recorded packages, in the right order.
    Example

    Here is a little demonstration of how pkgsnap works. First we create a project package library, and install pkgsnap there.

    lib_dir <- tempfile()
    dir.create(lib_dir)
    devtools::install_github("mangothecat/pkgsnap", lib = lib_dir)

    We instruct R to use this library directory. This is ideally done from a local .Rprofile file in your project directory. We also load pkgsnap.


    .libPaths(lib_dir)
    library(pkgsnap)

    Currently the new package library is empty, we will install some R packages from CRAN into it. The packages they depend on will be also installed.


    install.packages(c("testthat", "pkgconfig"))
    installed.packages(lib_dir)[, c("Package", "Version")]
    #>           Package     Version
    #> crayon    "crayon"    "1.3.1"
    #> digest    "digest"    "0.6.8"
    #> memoise   "memoise"   "0.2.1"
    #> pkgconfig "pkgconfig" "2.0.0"
    #> pkgsnap   "pkgsnap"   "1.0.0"
    #> praise    "praise"    "1.0.0"
    #> testthat  "testthat"  "0.11.0"

    To create a snapshot (to a temporary file this time), you can just call snap with the name of the snapshot file.


    snapshot <- tempfile()
    snap(to = snapshot)

    The packages and their versions are now recorded in the snapshot file. To demonstrate restoration we scrap our package library, and create a new one:


    unlink(lib_dir, recursive = TRUE)
    new_lib_dir <- tempfile()
    dir.create(new_lib_dir)
    .libPaths(new_lib_dir)

    We are ready to run restore:


    restore(snapshot)
    #> Downloading
    #>   crayon_1.3.1.tgz...  done.
    #>   digest_0.6.8.tgz...  done.
    #>   memoise_0.2.1.tgz...  done.
    #>   pkgconfig_2.0.0.tgz...  done.
    #>   pkgsnap_1.0.0.tgz...   pkgsnap_1.0.0.tar.gz...   pkgsnap_1.0.0.tar.gz... ERROR.
    #>   praise_1.0.0.tgz...  done.
    #>   testthat_0.11.0.tgz...  done.
    #> Installing
    #>   digest_0.6.8.tgz ... done.
    #>   memoise_0.2.1.tgz ... done.
    #>   crayon_1.3.1.tgz ... done.
    #>   pkgconfig_2.0.0.tgz ... done.
    #>   praise_1.0.0.tgz ... done.
    #>   testthat_0.11.0.tgz ... done.

    Note that we cannot restore the pkgsnap package, because it is not on CRAN (yet). The rest of the packages were downloaded and installed:


    installed.packages(new_lib_dir)[, c("Package", "Version")]
    #>           Package     Version
    #> crayon    "crayon"    "1.3.1"
    #> digest    "digest"    "0.6.8"
    #> memoise   "memoise"   "0.2.1"
    #> pkgconfig "pkgconfig" "2.0.0"
    #> praise    "praise"    "1.0.0"
    #> testthat  "testthat"  "0.11.0"

    Other approaches

    packrat is a package dependency management system by RStudio. It is (a lot) more sophisticated and has more features than the simple approach shown here. See more at https://rstudio.github.io/packrat

    MRAN is a project by Revolution Analytics, now Microsoft.   It allows you to go back in time and install CRAN packages as they were current on a given date.  See more at https://mran.revolutionanalytics.com/

    Summary

    pkgsnap is an R package that helps you recreate package libraries. pkgsnap has no compiled code, and no dependencies, so you can install it easily. It is curently available from GitHub: https://github.com/mangothecat/pkgsnap. You can install it via the devtools package:

    devtools::install_github("mangothecat/pkgsnap")

    Give it a try if you like the idea and tell us what you think.

    To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs