R bloggers
Mapping medical cannabis dispensaries: from PDF table to Google Map with R
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by Sheri Gilley, Microsoft Senior Software Engineer
In 2014, Illinois passed into law the creation of a medical cannabis pilot program. As my son has cancer and marijuana could greatly help with some of his symptoms, we eagerly applied for a card when registration was available early in 2015. The first dispensaries were not available until November 2015. At that time there were 9 dispensaries; the PDF file with a table of dispensary names and locations provided by the Illinois Department of Health was an adequate way to find a dispensary.
In the time that dispensaries have been available, my son has been in various hospitals and facilities in and around the city of Chicago. First we were in Park Ridge, then Hyde Park, then Hinsdale, then downtown Chicago and now finally back home in Oak Park. As we moved around the city, I would use that same PDF file to locate the dispensary closest to me. The list has grown from 9 names and addresses to 40 today. With 40 entries, the PDF table format is not at all useful for showing where the dispensaries are located. The entries are listed in the order of the license issue date, making it all the more difficult to see which dispensaries might be easiest for me to visit.
So one weekend I decided to create a map of all the current locations. Keeping in mind that more dispensaries will be available in the future, I wanted to create code that would read the official list of registered dispensaries, so that updates would be easy as more entries were added.
I knew I could read the text of the file in R using pdftools, and could put the locations onto a google map using googleVis. The hardest part of the code was trying to filter out the noise included in the text and reliably get the name, address, and phone number of each dispensary into a data frame. A few handy gsub statements worked their magic and I was left with data ready for mapping.
I added in some geocoding to get the longitude and latitude, thanks to this tip.
Finally, after the data manipulation, the code to produce the map itself is rather straightforward:
# create id and LatLong for googleVis map all$id < paste(all$name, all$address, all$phone, sep='; ') all$LatLong = paste(result$lat, result$long, sep=":")# Now plot with googleVis
require(googleVis)
g1 < gvisMap(all, “LatLong” , “id”,
options=list(showTip=TRUE,
showLine=TRUE,
enableScrollWheel=TRUE,
mapType=’normal’,
width=400, height=400
))
# this opens a browser with the plot
plot(g1)
# write the code that will be used on my website
cat(g1$html$chart, file=”dispensariesIL.html”)
I thought the map might be useful for others in Illinois also interested in finding a dispensary, so I created a small website for the map, and include a few other tips and tricks I have learned along the way by being the procurer of medicinal cannabis in my household.
Here is the resulting map, generated on August 12, 2016 (click to see the interactive version):
You can create this map yourself with this R code.
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ggtree paper published
(This article was first published on R on Guangchuang YU, and kindly contributed to Rbloggers)
Today ggtree received 100 stars on and I found the paper was online at the same day by the tweet:
ggtree, Methods in Ecology and Evolutionに掲載されたのか。オープンアクセスっぽいな。https://t.co/JlET18BSZv
— ホクソピペット (@siero5335) August 16, 2016
I am quite exciting about it. Now ggtree is citable by doi:10.1111/2041210X.12628.
In the review period, ggtree had already been cited by several papers:
1. https://doi.org/10.7717/peerj.976
Annotating phylogenetic tree with image files (i.e. secondary structure).
2. http://dx.doi.org/10.1016/j.meegid.2015.12.006
Highlight clades and annotate phylogenetic tree with genotype table.
3. http://dx.doi.org/10.3389%2Ffcimb.2016.00036
Annotate phylogenetic tree with binary matrix.
4. http://dx.doi.org/10.1104/pp.16.00359
Annotate phylogenetic tree with sequence features.
5. http://dx.doi.org/10.1101/067207
Annotate with bootstrap values.
6. http://www.nature.com/articles/srep30158
Visualize bootstrap value infer by RAxML and phyml on the same tree.
7. http://dx.doi.org/10.1101/053371
8. http://dx.doi.org/10.1105/tpc.16.00053
9. http://dx.doi.org/10.1111/mec.13474
10. http://dx.doi.org/10.1371/journal.pone.0160958
11. http://dx.doi.org/10.1111/2041210X.12593
12. http://dx.doi.org/10.1101/053371
The ggtree paper provides several comprehensive and reproducible examples. It can be a good reference for complex tree annotation. My slides in 9th China R conference summarized the functionalities that ggtree provided.
More information can be found in ggtree homepage. Hope you have fun with ggtree.
To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
14 new R jobs from around the world (20160816)
Here are the new R Jobs for 20160816.
To post your R job on the next postJust visit this link and post a new R job to the R community. You can either post a job for free (which works great), or pay $50 to have your job featured (and get extra exposure).
Current R jobsJob seekers: please follow the links below to learn more and apply for your R job of interest:
New Featured Jobs
 FullTime
 Assistant/Associate Professor in Innovative Qualitative Methodologies – 0035392016 Cluster Hire
Purdue University – Posted by Purdueclusterhiresearch  West Lafayette
Indiana, United States  11 Aug2016

 FullTime
 Assistant/Associate Professor in Innovative Quantitative Methodologies – 0035382016 Cluster Hire
Purdue University – Posted by Purdueclusterhiresearch  West Lafayette
Indiana, United States  11 Aug2016

 PartTime
 Shiny developer
ironAPI – Posted by Babak  Wien
Wien, Austria  8 Aug2016

 FullTime
 Data Science Faculty (Assistant/Associate Professor, Tenure Track)
Saint Louis University – Posted by armbrees  St. Louis
Missouri, United States  7 Aug2016

 FullTime
 Data Scientist / Bioinformatician / Statistician at University of California
UCSF – Posted by clif.duhn  San Francisco
California, United States  2 Aug2016

 Freelance
 Big Data in Digital Health (510 hours per week)
MedStar Intitute for Innovation – Posted by Praxiteles  Anywhere
 1 Aug2016

 FullTime
 Data Scientist – Analytics
Booking.com – Posted by work_at_booking  Anywhere
 1 Aug2016

 Freelance
 Instructor for teaching a 5day Course on data manipulation and visualization in R @ Berlin
Physaliacourses – Posted by Physaliacourses  Berlin
Berlin, Germany  13 Aug2016

 FullTime
 R Package Developer for The Center for Tropical Forest Science @ Washington, U.S.
Center for Tropical Forest Science – Forest Global Earth Observatory (CTFSForestGEO) – Posted by krizell  Washington
District of Columbia, United States  12 Aug2016

 FullTime
 Assistant/Associate Professor in Innovative Qualitative Methodologies – 0035392016 Cluster Hire
Purdue University – Posted by Purdueclusterhiresearch  West LafayetteIndiana, United States
 11 Aug2016

 FullTime
 Assistant/Associate Professor in Innovative Quantitative Methodologies – 0035382016 Cluster HirePurdue University – Posted by Purdueclusterhiresearch
 West Lafayette
Indiana, United States  11 Aug2016

 FullTime
 Senior Finance Manager at Amazon @ Seattle, Washington, United States
Amazon – Posted by Ryan  Seattle
Washington, United States  9 Aug2016

 PartTime
 Shiny developer
ironAPI – Posted by Babak  Wien
Wien, Austria  8 Aug2016

 FullTime
 Data Science Faculty (Assistant/Associate Professor, Tenure Track)
Saint Louis University – Posted by armbrees  St. Louis
Missouri, United States  7 Aug2016

 FullTime
 Data Scientists for IDATA SAS @ Medellín, Antioquia, Colombia
IDATA SAS – Posted by vmhoyos  Medellín
Antioquia, Colombia  5 Aug2016

 FullTime
 R Software Developer for University of Maryland Libraries
University of Maryland Libraries – Posted by dywong  College Park
Maryland, United States  4 Aug2016

 FullTime
 Data Visualiser
CensorNet – Posted by Stephanie Locke  Emersons Green
England, United Kingdom  4 Aug2016

 FullTime
 Data Scientist/Modeling – Applied Technology
Crowe Horwath – Posted by Neil.Schneider  Indianapolis
Indiana, United States  3 Aug2016

 FullTime
 Sr. Analyst – Client Insights/Design
Maritz Motivation Solutions – Posted by maritzrecruiter  Fenton
Missouri, United States  2 Aug2016

 FullTime
 Data Scientist / Bioinformatician / Statistician at University of California
UCSF – Posted by clif.duhn  San Francisco
California, United States  2 Aug2016

 Freelance
 Big Data in Digital Health (510 hours per week)
MedStar Intitute for Innovation – Posted by Praxiteles  Anywhere
 1 Aug2016
In Rusers.com you can see all the R jobs that are currently available.
Rusers ResumesRusers also has a resume section which features CVs from over 200 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.
(you may also look at previous R jobs posts).
The WinVector parallel computing in R series
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details.
For your convenience here they are in order:
 A gentle introduction to parallel computing in R
 Running R jobs quickly on many machines
 Can you nest parallel operations in R?
Please check it out, and please do Tweet/share these tutorials.
To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RQGIS 0.1.0 release
(This article was first published on R – jannesm, and kindly contributed to Rbloggers)
We proudly announce the release of RQGIS! RQGIS establishes an interface between R and QGIS, i.e. it allows the user to access the more than 1000 QGIS geoalgorithms from within R. To install it, run:
install.packages("RQGIS")Naturally, you need to install other software (such as QGIS) to run RQGIS properly. Therefore, we wrote a package vignette, which guides you through the installation process of QGIS, GRASS and SAGA on various platforms. To access it, run:
vignette("install_guide", package = "RQGIS") How to use RQGISTo introduce RQGIS, we will demonstrate how to calculate the SAGA topographic wetness index. The first step is the creation of a QGIS environment. This is a list containing the paths RQGIS needs to access QGIS. Luckily for us, set_env does this for us. We merely need to specify the root path to the QGIS installation. This is most likely something like C:/OSGeo4W~1 on Windows, /usr on Linux and /usr/local/Cellar on a Mac. If we do not specify a path to the QGIS root folder, set_env tries to figure it out. This, however, might be timeconsuming depending on the size of the drives to search.
library("RQGIS") env < set_env()Secondly, we need to find out the name of the function in QGIS which calculates the wetness index:
find_algorithms("wetness index", name_only = TRUE, qgis_env = env) ## [1] "taudem:topographicwetnessindex" "saga:sagawetnessindex" ## [3] "saga:topographicwetnessindextwi" ""There are three algorithms containing the words wetness and index in their short description. Here, we choose saga:sagawetnessindex. To retrieve the corresponding function arguments, we use get_args_man. By setting option to TRUE, we indicate that we would like to use the default values, if available:
args < get_args_man(alg = "saga:sagawetnessindex", options = TRUE, qgis_env = env) # print the retrieved list args ## $DEM ## [1] "None" ## ## $SUCTION ## [1] "10.0" ## ## $AREA_TYPE ## [1] "0" ## ## $SLOPE_TYPE ## [1] "0" ## ## $SLOPE_MIN ## [1] "0.0" ## ## $SLOPE_OFF ## [1] "0.1" ## ## $SLOPE_WEIGHT ## [1] "1.0" ## ## $AREA ## [1] "None" ## ## $SLOPE ## [1] "None" ## ## $AREA_MOD ## [1] "None" ## ## $TWI ## [1] "None"Of course, we need to specify certain function arguments such as the input (DEM) and output (TWI) arguments. Please note that RQGIS accepts as input argument either the path to a spatial object or a spatial object residing in R. Here, we will use a digital elevation model, which comes with the RQGIS package:
# load data into R data("dem", package = "RQGIS") # define input args$DEM < dem # specify output path args$TWI < "twi.asc"Finally, we can access QGIS from within R by supplying run_qgis with the specified arguments as a list. Specifying also the load_outputargument directly loads the QGIS output into R.
twi < run_qgis(alg = "saga:sagawetnessindex", params = args, load_output = args$TWI, qgis_env = env) # visualize the result library("raster") hs < hillShade(terrain(dem), terrain(dem, "aspect"), 40, 270) pal < RColorBrewer::brewer.pal(6, "Blues") spplot(twi, col.regions = pal, alpha.regions = 0.8, scales = list(tck = c(1, 0)), colorkey = list(space = "bottom", width = 1, height = 0.5, axis.line = list(col = "black")), cuts = length(pal)  1) + latticeExtra::as.layer(spplot(hs, col.regions = gray(0:100 / 100)), under = TRUE)
For more information on RQGIS, please refer to https://github.com/jannesm/RQGIS.
To leave a comment for the author, please follow the link and comment on their blog: R – jannesm. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
15 Page Tutorial for R
(This article was first published on R – StudyTrails, and kindly contributed to Rbloggers)
For Beginners in R, here is a 15 page example based tutorial that covers the basics of R.
 Starting R – Trivial tutorial on how to start R for those just wondering what to do next after downloading R.
 Assignement Operator – Two important assignment operators in R are < and =
 Listing Objects – All entities in R are called objects. They can be arrays, numbers, strings, functions. This tutorial will cover topics such as listing all objects, listing object from a specific environment and listing objects that satisfy a particular pattern.
 Sourcing R File – R code can also be written in a file and then the file can be called from the R code.
 Basic Datastructures in R – Understanding data structures is probably the most important part of learning R. This tutorial covers vector and list. It also covers subsetting.
 Data Structures in R, Matrix and Array – Covers matrix and vectors. An array is a vector with additional attributes dim which stores the dimension of the array and dimnames which stores the names of the dimensions. A matrix is an 2 dimensional array. Head to the tutorial for examples of both.
 Data Structures in R, factors and Data Frame – DataFrames are probably the most widely used data structure. It would help to just go through the examples and practice them. The tutorial covers important operations on the data frame and factors as well as subsetting data frames.
 Data Structures in R, Data Frame Operations – Covers some more operations on the data frame; including stack, attach, with, within, transform, subset, reshape and merge
 Control Structures in R – The basics of any programming language. Control loops allow looping through data structures. The tutorial covers if, ifelse, for, while, next, break, repeat and switch
 Control Structures in R – apply – To make looping more efficient R has introduced a family of ‘apply’ functions. For example – the apply function can be used apply a function over specific elements of an array (or matrix). The tutorial covers lapply, sapply, apply, tapply.
 Control Structures in R – apply 2 – We continue with some more apply functions – mapply and by.
 Functions in R – The nuts and bolts of any programming language. This tutorial not only explains the concept of functions using examples but also covers various scenarios such as anonymous functions or passing functions around.
 Printing on Console in R – Printing on console can come very handy. The tutorial covers the print and cat functions as well as printing data frames.
 Pretty printing using Format function in R – This tutorial looks at how to use the formatting functions for pretty printing.
 Reshape and Reshape2 Package – Once you start working on real life problems in R, a lot of time would be spent on manipulating data. Reshape and Reshape2 package will prove very powerful in converting data to the format required by other libraries. This tutorial has detailed examples to explain the package.
These tutorials are designed for beginners in R, but they can also be used by experienced programmers as a refresher course or as reference. Running loops in R can be slow and therefore the apply group of functions as well as the reshape package can drastically improve the performance of the code.
We hope you enjoy the tutorials.
To leave a comment for the author, please follow the link and comment on their blog: R – StudyTrails. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
tidyr 0.6.0
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
I’m pleased to announce tidyr 0.6.0. tidyr makes it easy to “tidy” your data, storing it in a consistent form so that it’s easy to manipulate, visualise and model. Tidy data has a simple convention: put variables in the columns and observations in the rows. You can learn more about it in the tidy data vignette. Install it with:
install.packages("tidyr")I mostly released this version to bundle up a number of small tweaks needed for R for Data Science. But there’s one nice new feature, contributed by Jan Schulz: drop_na(). drop_na()drops rows containing missing values:
df < tibble(x = c(1, 2, NA), y = c("a", NA, "b")) df #> # A tibble: 3 × 2 #> x y #> <dbl> <chr> #> 1 1 a #> 2 2 <NA> #> 3 NA b # Called without arguments, it drops rows containing # missing values in any variable: df %>% drop_na() #> # A tibble: 1 × 2 #> x y #> <dbl> <chr> #> 1 1 a # Or you can restrict the variables it looks at, # using select() style syntax: df %>% drop_na(x) #> # A tibble: 2 × 2 #> x y #> <dbl> <chr> #> 1 1 a #> 2 2 <NA>Please see the release notes for a complete list of changes.
To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Probably the most useful R function I’ve ever written
(This article was first published on R language – Burns Statistics, and kindly contributed to Rbloggers)
The function in question is scriptSearch. I’m not much for superlatives — “most” and “best” imply one dimension, but we live in a multidimensional world. I’m making an exception.
The statistic I have in mind for this use of “useful” is the waiting time between calls to the function divided by the human time saved by the call.
I wrote a version of this for a company where I do consulting. There are few days working there that I don’t have at least one bout with it. Using scriptSearch can easily save half an hour compared to what I would have done prior to having the function.
scriptSearchThe two main inputs are:
 a string to search for
 a directory to search in
By default it only looks in R scripts in the directory (and its subdirectories).
Examples of directories to search are:
 directory holding a large collection of R scripts
 directory holding the source for local R packages
 personal directory with lots of subdirectories containing R scripts and functions
Examples of uses are:
 where is blimblam defined?
 where are all the uses of splishsplash in the local packages (because I want to change its arguments)?
 a few weeks ago I created a pdf called factor_history, where is the code that produced that?
These uses might be done with something like:
 scriptSearch("blimblam *<", "path/to/scriptFarm", sub=FALSE)
 scriptSearch("splishsplash", "path/to/Rsource")
 scriptSearch("factor_history", "..")
You may be confused by the asterisk in the first call. The string to search for can be a regular expression. In this case the asterisk means that it will find assignments whether or not there is a space between the object name and the assignment arrow.
BurStMiscscriptSearch was the main motivation for updating the BurStMisc package to version 1.1. The package is on CRAN.
ntileThe ntile function is also new to BurStMisc. It returns equallysized ordered groups from a numeric vector — for example, quintiles or deciles.
A more primitive version of the function appeared in a blog post called “Miles of iles”. There is some discussion there of alternative functions.
writeExpectTestWhile I was preparing the update to BurStMisc, I found that automating the writing of some tests using the testthat package was both warranted and feasible. The writeExpectTest function is the result.
cornerThe generally useful function that was already in BurStMisc is corner. This allows you to see a few rows and columns of a large matrix or data frame — or higher dimensional array.
EpilogueI want to spread the news
That if it feels this good getting used
You just keep on using me
— from “Use Me” by Bill Withers
The post Probably the most useful R function I’ve ever written appeared first on Burns Statistics.
To leave a comment for the author, please follow the link and comment on their blog: R language – Burns Statistics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The inexorable growth of student debt, charted with R
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Len Kiefer, Deputy Chief Economist at Freddie Mac, recently published the following chart to his personal blog showing household debt in the United States (excluding mortgage debt). As you can see, student loan debt has steadily increased over the last 13 years and has now eclipsed all other forms of nonmortgage debt:
He also created this animated chart showing the growth (and occasional decline) of all forms of debt (including mortgages). All categories are scaled to the same nominal value in March 2003, and since that time student debt in the US has more than quintupled.
Both charts were created using the R language (the code is included in the blog post linked below). The data come from the New York Federal Reserve, which Robert reads into R using fread from the data.table package. The line charts were created using the ggplot2 package, with the ggrepel extension to keep the labels from overlapping. The animated version was created using the saveGIF function of the animation package.
For more charts (including some interesting bystate charts) and the complete details on the implementation, follow the link to Len's blog below.
Len Kiefer: Consumer Credit Trends (via Sharon Machlis)
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Can you nest parallel operations in R?
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
When we teach parallel programming in R we start with the basic use of parallel (please see here for example). This is, in our opinion, a necessary step before getting into clever notation and wrapping such as doParallel and foreach. Only then do the students have a sufficiently explicit interface to frame important questions about the semantics of parallel computing. Beginners really need a solid mental model of what services are really being provided by their tools and to test edge cases early.
One question that comes up over and over again is “can you nest parLapply?”
The answer is “no.” This is in fact an advanced topic, but it is one of the things that pops up when you start worrying about parallel programming. Please read on for what that is the right answer and how to work around that (simulate a “yes”).
I don’t think the above question is usually given sufficient consideration (nesting parallel operations can in fact make a lot of sense). You can’t directly nest parLapply, but that is a different issue than can one invent a workaround. For example: a “yes” answer (really meaning there are workarounds) can be found here. Again this is a different question than “is there a way to nest foreach loops” (which is possible through the nesting operator %.% which presumably handles working around nesting issues in parLapply).
Let’s set up a concrete example, so we can ask and answer a precise question. Suppose we have a list of jobs (coming from an external source) that we will simulate with the code fragment below.
jobs < list(1:3,5:10,100:200)Notice the jobs have wildly diverging sizes, this is an important consideration.
Suppose the task we want to perform is some the square roots of the entries. The standard (nonparallel) calculation would look like the following.
worker1 < function(x) { sum(sqrt(x)) } lapply(jobs,worker1) ## [[1]] ## [1] 4.146264 ## ## [[2]] ## [1] 16.32201 ## ## [[3]] ## [1] 1231.021For didactic purposes please pretend that the sum function is very expensive and the sqrt function is somewhat expensive.
If it was obvious we always had a great number of small sublists we would want to use parallelization to make sure we are performing many sums at the same time. We would then parallelize over the first level as below.
clus < parallel::makeCluster(4) parallel::parLapplyLB(clus,jobs,worker1) ## [[1]] ## [1] 4.146264 ## ## [[2]] ## [1] 16.32201 ## ## [[3]] ## [1] 1231.021Notice that parallel::parLapplyLB uses almost the same calling convention as lapply and returns the exact same answer.
If it was obvious we had a single large sublist we would want to make sure we were always parallelizing the sqrt operations so we would prefer to parallelize as follows:
mkWorker2 < function(clus) { force(clus) function(x) { xs < parallel::parLapplyLB(clus,x,sqrt) sum(as.numeric(xs)) } } worker2 < mkWorker2(clus) lapply(jobs,worker2) ## [[1]] ## [1] 4.146264 ## ## [[2]] ## [1] 16.32201 ## ## [[3]] ## [1] 1231.021(For the details of building functions and passing values to remote workers please see here.)
If we were not sure if in the future what structure we would encounter we would prefer to schedule all operations for possible parallel execution. This would minimize the number of idle resources and minimize the time to finish the jobs. Ideally that would look like the following (a nested use of parallel):
parallel::parLapplyLB(clus,jobs,worker2) ## Error in checkForRemoteErrors(val): 3 nodes produced errors; first error: invalid connectionNotice the above fails with an error. Wishing for flexible code is what beginners intuitively mean when they as if you can nest parallel calls. They may not be able to explain it, but they are worried they don’t have a good characterization of the work they are trying to parallelize over. They are not asking if things get magically faster by “parallelizing parallel.”
It isn’t too hard to find out what the nature of the error is: the communication connection socket file descriptors (con) are passed as integers to each machine, but they are not valid descriptors where they arrive (they are just integers). We can see this by looking at the structure of the cluster:
str(clus) ## List of 4 ## $ :List of 3 ## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 5 ## .. .. .. attr(*, "conn_id")=<externalptr> ## ..$ host: chr "localhost" ## ..$ rank: int 1 ## .. attr(*, "class")= chr "SOCKnode" ## $ :List of 3 ## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 6 ## .. .. .. attr(*, "conn_id")=<externalptr> ## ..$ host: chr "localhost" ## ..$ rank: int 2 ## .. attr(*, "class")= chr "SOCKnode" ## $ :List of 3 ## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 7 ## .. .. .. attr(*, "conn_id")=<externalptr> ## ..$ host: chr "localhost" ## ..$ rank: int 3 ## .. attr(*, "class")= chr "SOCKnode" ## $ :List of 3 ## ..$ con :Classes 'sockconn', 'connection' atomic [1:1] 8 ## .. .. .. attr(*, "conn_id")=<externalptr> ## ..$ host: chr "localhost" ## ..$ rank: int 4 ## .. attr(*, "class")= chr "SOCKnode" ##  attr(*, "class")= chr [1:2] "SOCKcluster" "cluster" mkWorker3 < function(clus) { force(clus) function(x) { as.character(clus) } } worker3 < mkWorker3(clus) parallel::parLapplyLB(clus,jobs,worker3) ## [[1]] ## [1] "list(con = 5, host = \"localhost\", rank = 1)" ## [2] "list(con = 6, host = \"localhost\", rank = 2)" ## [3] "list(con = 7, host = \"localhost\", rank = 3)" ## [4] "list(con = 8, host = \"localhost\", rank = 4)" ## ## [[2]] ## [1] "list(con = 5, host = \"localhost\", rank = 1)" ## [2] "list(con = 6, host = \"localhost\", rank = 2)" ## [3] "list(con = 7, host = \"localhost\", rank = 3)" ## [4] "list(con = 8, host = \"localhost\", rank = 4)" ## ## [[3]] ## [1] "list(con = 5, host = \"localhost\", rank = 1)" ## [2] "list(con = 6, host = \"localhost\", rank = 2)" ## [3] "list(con = 7, host = \"localhost\", rank = 3)" ## [4] "list(con = 8, host = \"localhost\", rank = 4)"What we are getting wrong is: we can’t share control of the cluster to each worker just by passing the cluster object around. This would require some central registry and callback scheme (which is one of the things packages like foreach and doParallel accomplish when they “register a parallel backend to use”). Base parallel depends more on explicit reference to the cluster data structure, so it isn’t “idiomatic parLapply” to assume we can find “the parallel cluster” (there could in fact be more than one at the same time).
So what is the work around?
One work around is to move to sophisticated wrappers (like doParallel or even future, also see here).
Another fix is to (at the cost of time effort and space) to reorganize the calculation into two sequenced phases, each of which is parallel but not nested. It is a bit involved, but we show how to do that below (using R’s Reduce and split functions to reorganize the data, though one could also use socalled “tidyverse” methods).
# Preparation 1: collect all items into one flat list sqrtjobs < as.list(Reduce(c,jobs)) # Phase 1: sqrt every item in parallel sqrts < parallel::parLapplyLB(clus,sqrtjobs,sqrt) # Preparation 2: reassemble new job list that needs only sums lengths < vapply(jobs,length,numeric(1)) pattern < lapply(seq_len(length(lengths)), function(i) {rep(i,lengths[[i]])}) pattern < Reduce(c,pattern) sumjobs < split(sqrts,Reduce(c,pattern)) sumjobs < lapply(sumjobs,as.numeric) names(sumjobs) < names(jobs) # Phase 2: sum all items in parallel parallel::parLapplyLB(clus,sumjobs,sum) ## [[1]] ## [1] 4.146264 ## ## [[2]] ## [1] 16.32201 ## ## [[3]] ## [1] 1231.021In conclusion: you can’t directly nest parLapply, but you can usefully sequence through it.
parallel::stopCluster(clus)To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Dates and Times – Simple and Easy with lubridate exercises (part 1)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
As in any programming language, handling date and time variables can be quite frustrating, since, for example, there is no one single format for dates, there are different time zones and there are issues such as daylight saving time.
Base R provides several packages for handling date and time variables, but they require mastering cumbersome syntax.
In order to solve all those issues and more, R package “lubridate” was created. This package on one hand has a very easy and intuitive syntax and on the other hand has functions that cover a wide range of issues related to dates and times.
In this first part in the series of lubridate exercises, the basic functionality of the package is covered.
As always, let’s start by downloading and installing the package:
install.packages("lubridate")
library(lubridate)
Answers to the exercises are available here.
If you have different solutions, feel free to post them.
Parsing Datetimes
The ymd() series of functions are used to parse character strings into dates.
The letters y, m, and d correspond to the year, month, and day elements of a datetime.
Exercise 1
Populate a variable called “start_date” with a date representation of string “23012017”
Exercise 2
Use the lubridate function today to print the current date
Exercise 3
Extract the year part from the “start_date” variable created on exercise 1
Exercise 4
Extract the month part from the “start_date” variable created on exercise 1
Exercise 5
Extract the day part from the “start_date” variable created on exercise 1
Exercise 6
Set the month in variable “start_date” to February
Exercise 7
Add 6 days to variable “start_date”.
Did you notice what happened to the month value?
Exercise 8
Substract 3 months from variable “start_date”
Exercise 9 (Advanced)
Populate a field called concatenated_dates with a vector of dates containing the following values:
“31.12.2015”, “01.01.2016”, “15.02.2016”
Exercise 10 (Advanced)
Calculate in a short and simple way the addition of 1 thru 10 days to “start_date” variable
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
colourpicker: A colour picker widget for Shiny apps, RStudio, Rmarkdown, and ‘htmlwidgets’
(This article was first published on Dean Attali's R Blog, and kindly contributed to Rbloggers)
Have you ever wanted to allow your users to select colours in your Shiny apps? Have you ever wanted to select a few colours to use in your R code, but found it tedious to search for the right colours? If you answered yes to any of those questions, or if you’re just curious, then colourpicker is the package for you!
The new colourpicker package gives you a colour picker widget that can be used in different contexts in R. Most of the functionality has existed in the shinyjs package for the past year and this package is simply a way to graduate all the colour picker functions into their own package.
Demos Click here to view a live
interactive demo of the colour picker input available for Shiny apps.  Click here to see the colour picker addin that lets you select colours interactively.
colourpicker is available through both CRAN and GitHub:
To install the stable CRAN version:
install.packages("colourpicker")To install the latest development version from GitHub:
install.packages("devtools") devtools::install_github("daattali/colourpicker") Overview of main functionscolourpicker exposes three functions: colourInput() (for Shiny apps), colourPicker() (RStudio addin), and colourWidget() (an htmlwidget).
In Shiny apps (or R markdown): colourInput()You can use colourInput() to include a colour picker input in Shiny apps (or in R markdown documents). It works just like any other native Shiny input, here is an example:
library(shiny) shinyApp( ui = fluidPage( colourInput("col", "Select colour", "purple"), plotOutput("plot") ), server = function(input, output) { output$plot < renderPlot({ set.seed(1) plot(rnorm(50), bg = input$col, col = input$col, pch = 21) }) } ) To select colours to use in your R code: colourPicker()colourpicker also provides an RStudio addin that can be used to easily select colours and save them as a variable in R. This can be useful if, for example, you want to pick some colours for a plot and you want an easy way to visualize and select a few colours. Here is a screenshot of the colour picker addin (you can either access this tool using the Addins menu or with colourPicker()). You can also watch a short GIF of it an action.
As an ‘htmlwidgets’ widgetThe colour picker input is also available as an ‘htmlwidgets’ widget using the colourWidget() function. This may not be terribly useful right now since you can use the more powerful colourInput in Shiny apps and Rmarkdown documents, but it may come in handy if you need a widget.
Features of colourInput() Simple and familiarUsing colourInput is extremely trivial if you’ve used Shiny, and it’s as easy to use as any other input control. It was implemented to very closely mimic all other Shiny inputs so that using it will feel very familiar. You can add a simple colour input to your Shiny app with colourInput("col", "Select colour", value = "red"). The return value from a colourInput is an uppercase HEX colour, so in the previous example the value of input$col would be #FF0000 (#FF0000 is the HEX value of the colour red). The default value at initialization is white (#FFFFFF).
Allowing “transparent”Since most functions in R that accept colours can also accept the value “transparent”, colourInput has an option to allow selecting the “transparent” colour. By default, only real colours can be selected, so you need to use the allowTransparent = TRUE parameter. When this feature is turned on, a checkbox appears inside the input box.
If the user checks the checkbox for “transparent”, then the colour input is grayed out and the returned value of the input is transparent. This is the only case when the value returned from a colourInput is not a HEX value. When the checkbox is unchecked, the value of the input will be the last selected colour prior to selecting “transparent”.
By default, the text of the checkbox reads “Transparent”, but you can change that with the transparentText parameter. For example, it might be more clear to a user to use the word “None” instead of “Transparent”. Note that even if you change the checkbox text, the return value will still be transparent since that’s the actual colour name in R.
This is what a colour input with transparency enabled looks like
How the chosen colour is shown inside the inputBy default, the colour input’s background will match the selected colour and the text inside the input field will be the colour’s HEX value. If that’s too much for you, you can customize the input with the showColour parameter to either only show the text or only show the background colour.
Here is what a colour input with each of the possible values for showColour looks like
Limited colour selectionIf you want to only allow the user to select a colour from a specific list of colours, rather than any possible HEX colour, you can use the palette = "limited" parameter. By default, the limited palette will contain 40 common colours, but you can supply your own list of colours using the allowedCols parameter. Here is an image of the default limited colour palette.
Updating a colourInputAs with all other Shiny inputs, colourInput can be updated with the updateColourInput function. Any parameter that can be used in colourInput can be used in updateColourInput. This means that you can start with a basic colour input such as colourInput("col", "Select colour") and completely redesign it with
updateColourInput(session, "col", label = "COLOUR:", value = "orange", showColour = "background", allowTransparent = TRUE, transparentText = "None") Flexible colour specificationSpecifying a colour to the colour input is made very flexible to allow for easier use. When giving a colour as the value parameter of either colourInput or updateColourInput, there are a few ways to specify a colour:
 Using a name of an R colour, such as red, gold, blue3, or any other name that R supports (for a full list of R colours, type colours())
 If transparency is allowed in the colourInput, the value transparent (lowercase) can be used. This will update the UI to check the checkbox.
 Using a 6character HEX value, either with or without the leading #. For example, initializing a colourInput with any of the following values will all result in the colour red: ff0000, FF0000, #ff0000.
 Using a 3character HEX value, either with or without the leading #. These values will be converted to full HEX values by automatically doubling every character. For example, all the following values would result in the same colour: 1ac, #1Ac, 11aacc.
If you’re worried that maybe someone viewing your Shiny app on a phone won’t be able to use this input properly – don’t you worry. I haven’t quite checked every single device out there, but I did spend extra time making sure the colour selection JavaScript works in most devices I could think of. colourInput will work fine in Shiny apps that are viewed on Android cell phones, iPhones, iPads, and even Internet Explorer 8+.
As usual, if you have any comments, feel free to contact me.
To leave a comment for the author, please follow the link and comment on their blog: Dean Attali's R Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
rfoaas 1.0.0
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
The big 1.0.0 is here! Following the unsurpassed lead of the FOAAS project, we have arrived at a milestone: Release 1.0.0 is now on CRAN.
The rfoaas package provides an interface for R to the most excellent FOAAS service–which itself provides a modern, scalable and RESTful web service for the frequent need to tell someone to f$#@ off.
Release 1.0.0 brings fourteen (!!) new access points: back(), bm(), gfy(), think(), keep(), single_(), look(), looking(), no(), give(), zero(), pulp(), sake(), and anyway(). All with documentation and testing.
Even more neatly, thanks to a very pull request by Tyler Hunt, we can now issue random FOs via the getRandomFO() function!
As usual, CRANberries provides a diff to the previous CRAN release. Questions, comments etc should go to the GitHub issue tracker. More background information is on the project page as well as on the github repo
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
ChiSquared Test
(This article was first published on DataScience+, and kindly contributed to Rbloggers)
Before we build stats/machine learning models, it is a good practice to understand which predictors are significant and have an impact on the response variable.
In this post we deal with a particular case when both your response and predictor are categorical variables.
By the end of this you’d have gained an understanding of what predictive modelling is and what the significance and purpose of chisquare statistic is. We will go through a hypothetical case study to understand the math behind it. We will actually implement a chisquared test in R and learn to interpret the results. Finally you’ll be solving a mini challenge before we discuss the answers.
Background knowledge – Predictive modelling
 Background Knowledge
 Case Study – Effectiveness of a drug treatment
 Purpose and math behind ChiSq statistic
 ChiSq Test
 R Code
 MiniChallenge
For the sake of completeness, let’s begin by understanding how predictive modelling works so you can better appreciate the significance of chisquared tests and how it fits in the process.
Predictive modelling is a technique where we use statistical modelling or machine learning algorithms to predict a response variable/s based on one or more predictors.
The predictors are typically features that influence the response in some way. The models work best if the features are meaningful and have a significant relationship with the response.
But, you normally wouldn’t know beforehand if the response is dependent on a given feature or not. We can use the chisquared test to determine if they are dependent or not, provided, both response and predictors are categorical variables.
Hypothetical Example: Effectiveness of a Drug TreatmentLet’s consider a hypothetical case where we test the effectiveness of a drug for a certain medical condition.
Suppose we have 105 patients under study and 50 of them were treated with the drug. The remaining 55 patients were kept as control samples. The health condition of all patients was checked after a week.
The following table shows if their condition improved or not. Just by looking at it, can you tell if the drug had a positive effect on the patients.
As you can see, 35 out of the 50 patients showed improvement. Suppose if the drug had no effect, the 50 would have been split the the same proportion as the patients who were not given the treatment. But in this case, about 70% of patients showed improvement, which is significantly higher than the control case.
Since both categorical variables have only 2 levels, it was sort of intuitive to day that the drug treatment and health condition are dependent. But, as the number of categories increase, we need to quantifiable statistic to definitively say if they are dependent or not.
One such a metric is the chisquared statistic.
ChiSquared StatisticFor sake of understanding, lets see how Chisquared statistic is computed for the 2 by 2 case.
To begin with, we will assume that the 2 variables are not related to each other. In that is the case, can you tell what would the expected value of each cell? For example, the first cell will take the value: 50 times 75 by 105, which equals 35.7.
All the expected values can be computed this way (shown in brackets).
Once that is done, the ChiSq statistic is computed as follows.
$$\chi^2= \sum_{i=1}^{n} \frac{(O_i – E_i)^2}{E_i}$$
Numeric Computation
\(\ ChiSq = ((3529.04)^2 / 29.04) + ((1520.95)^2 / 20.95) + \)
\(\ ((2631.95)^2 / 31.95) + ((2923.04)^2/23.04) = 5.56 \)
This value will be larger if the difference between the actual and expected values widens.
Also, the more the categories in the variables the larger the chisquared statistic should be.
ChiSquared TestIn order to establish that 2 categorical variables are dependent, the chisquared statistic should be above a certain cutoff. This cutoff increases as the number of classes within the variable increases.
Alternatively, you can just perform a chisquared test and check the pvalues.
Like all statistical tests, chisquared test assumes a null hypothesis and an alternate hypothesis. The general practice is, if the pvalue that comes out in the result is less than a predetermined significance level, which is 0.05 usually, then we reject the null hypothesis.
H0: The The two variables are independent
H1: The two variables are related.
The null hypothesis of the chisquared test is that the two variables are independent and the alternate hypothesis is that they are related.
R CodeLet’s work it out in R by doing a chisquared test on the treatment (X) and improvement (Y) columns in treatment.csv
First, read in the treatment.csv data.
Let’s do the chisquared test using the chisq.test() function. It takes the two vectors as the input. We also set `correct=FALSE` to turn off Yates’ continuity correction.
# Chisq test chisq.test(df$treatment, df$improvement, correct=FALSE) Pearson's Chisquared test data: df$treatment and df$improvement Xsquared = 5.5569, df = 1, pvalue = 0.01841We have a chisquared value of 5.55. Since we get a pValue less than the significance level of 0.05, we reject the null hypothesis and conclude that the two variables are in fact dependent. Sweet!
MiniChallengeFor this challenge, find out if the ‘cyl’ and ‘carb’ variables in ‘mtcars’ dataset are dependent or not. Go ahead, pause and try it out. I will be showing the answers in a few seconds. Good Luck!
So here is my answer:
Let’s have a look the table of mtcars$carb vs mtcars$cyl.
Since there are more levels, it’s much harder to make out if they are related. Let’s use the chisquared test instead.
# Chisq test chisq.test(mtcars$carb, mtcars$cyl) Pearson's Chisquared test data: mtcars$carb and mtcars$cyl Xsquared = 24.389, df = 10, pvalue = 0.006632We have a high chisquared value and a pvalue of less that 0.05 significance level. So we reject the null hypothesis and conclude that carb and cyl have a significant relationship.
Congratulations of you got it right!
If you liked this post, you might find my latest video course ‘Introduction to R Programming’ to be quite resourceful. Happy Learning!
Related Post
 Missing Value Treatment
 R for Publication by Page Piccinini
 Assessing significance of slopes in regression models with interaction
 First steps with NonLinear Regression in R
 Standard deviation vs Standard error
To leave a comment for the author, please follow the link and comment on their blog: DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Simulating local community dynamics under ecological drift
(This article was first published on R – biologyforfun, and kindly contributed to Rbloggers)
In 2001 the book by Stephen Hubbell on the neutral theory of biodiversity was a major shift from classical community ecology. Before this book the nicheassembly framework was dominating the study of community dynamics. Very briefly under this framework local species composition is the result of the resource available at a particular site and species presence or absence depends on species niche (as defined in Chase 2003). As a result community assembly was seen as a deterministic process and a specific set of species should emerge from a given set of resources and species composition should be stable as long as the system is not disturbed. Hubbell theory based on a dispersalassembly framework (historically developed by MacArthur and Wilson) was a departure from this mainstream view of community dynamics by assuming that local species composition are transient with the absence of any equilibrium. Species abundance follow a random walk under ecological drift, species are continuously going locally extinct while a constant rain of immigrants from the metacommunity bring new recruits. The interested reader is kindly encouraged to read: this, that or that.
Being an R geek and after reading Chapter 4 of Hubbell’s book, I decided to implement his model of local community dynamic under ecological drift. Do not expect any advanced analysis of the model behavior, Hubbell actually derive in this chapter analytical solutions for this model. This post is mostly educational to get a glimpse into the first part of Hubbell’s neutral model.
Before diving into the code it is helpful to verbally outline the main characteristics of the model:
 The model is individualbased and neutral, assuming that all individuals are equal in their probability to die or to persist.
 The model assume a zerosum dynamic, the total number of individual (J) is constant, imagine for example a forest. When a tree dies it allows for a new individual to grow and use the free space
 At each time step a number of individual (D) dies and are replaced by new recruits coming either from the metacommunity (with probability m) or from the local communities (with probability 1m)
 The species identity of the new recruit is proportional to the species relative abundance either in the metacommunity (assumed to be fixed and having a lognormal distribution) or in the local community (which obviously varies)
We are now equipped to immerse ourselves into R code:
#implement zerosum dynamic with ecological drift from Hubbel theory (Chp4) untb<function(R=100,J=1000,D=1,m=0,Time=100,seed=as.numeric(Sys.time())){ mat<matrix(0,ncol=R,nrow=Time+1,dimnames= + list(paste0("Time",0:Time),as.character(1:R))) #the metacommunity SAD metaSAD<rlnorm(n=R) metaSAD<metaSAD/sum(metaSAD) #initialize the local community begin<table(sample(x=1:R,size = J,replace = TRUE,prob = metaSAD)) mat[1,attr(begin,"dimnames")[[1]]]<begin #loop through the time and get community composition for(i in 2:(Time+1)){ newC<helper_untb(vec = mat[(i1),],R = R,D = D,m = m,metaSAD=metaSAD) mat[i,attr(newC,"dimnames")[[1]]]<newC } return(mat) } #the function that compute individual death and #replacement by new recruits helper_untb<function(vec,R,D,m,metaSAD){ ind<rep(1:R,times=vec) #remove D individuals from the community ind_new<ind[sample(x = 1:length(ind),size = D,replace=FALSE)] #are the new recruit from the Local or Metacommunity? recruit<sample(x=c("L","M"),size=D,replace=TRUE,prob=c((1m),m)) #sample species ID according to the relevant SAD ind_recruit<c(ind_new,sapply(recruit,function(isLocal) { if(isLocal=="L") sample(x=ind_new,size=1) else sample(1:R,size=1,prob=metaSAD) }))I always like simple theories that can be implemented with few lines of code and do not require fancy computations. Let’s put the function in action and get for a closed community (m=0) of 10 species the species abundance and the rankabundance distribution dynamics:
library(reshape2) #for data manipulation I library(plyr) #for data manipulation II library(ggplot2) #for the nicest plot in the world #look at population dynamics over time comm1<untb(R = 10,J = 100,D = 10,m = 0,Time = 200) #quick and dirty first graphs comm1m<melt(comm1) comm1m$Time<as.numeric(comm1m$Var1)1 ggplot(comm1m,aes(x=Time,y=value,color=factor(Var2))) +geom_path()+labs(y="Species abundance") +scale_color_discrete(name="Species ID") #let's plot the RAD curve over time pre_rad<apply(comm1,1,function(x) sort(x,decreasing = TRUE)) pre_radm<melt(pre_rad) pre_radm$Time<as.numeric(pre_radm$Var2)1 ggplot(pre_radm,aes(x=Var1,y=value/100,color=Time,group=Time)) +geom_line()+labs(x="Species rank",y="Species relative abundance")We see there an important feature of the zerosum dynamics, with no immigration the model will always (eventually) lead to the dominance of one species. This is because with no immigration there are socalled absorbing states, species abundance are attracted to these states and once these values are reached species abundance stay constant. In this case there are only two absorbing states for species relative abundance: 0 or 1, either you go extinct or you dominate the community.
In the next step we open up the community while varying the death rate and we track species richness dynamic:
#apply this over a range of parameters pars<expand.grid(R=20,J=200,D=c(1,2,4,8,16,32,64), + m=c(0,0.01,0.05,0.1,0.5),Time=200) comm_list<alply(pars,1,function(p) do.call(untb,as.list(p))) #get species abundance for each parameter set rich<llply(comm_list,function(x) apply(x,1,function(y) length(y[y!=0]))) rich<llply(rich,function(x) data.frame(Time=0:200,Rich=x)) rich<rbind.fill(rich) rich$D<rep(pars$D,each=201) rich$m<rep(pars$m,each=201) ggplot(rich,aes(x=Time,y=Rich,color=factor(D)))+geom_path() +facet_grid(.~m)As we increase the probability that new recruit come from the metacommunity, the impact of increasing the number of death per time step becomes null. This is because the metacommunity is static in this model, even if species are wiped out by important local disturbance there will always be new immigrants coming in maintaining species richness to rather constant levels.
There is an R package for everything these days and the neutral theory is not an exception, check out the untb package which implement the full neutral model (ie with speciation and dynamic metacommunities) but also allow to fit the model to empirical data and to get model parameter estimate.
Filed under: Biological Stuff, R and Stat Tagged: Biodiversity, Neutral theory, R
To leave a comment for the author, please follow the link and comment on their blog: R – biologyforfun. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Gaussian predictive process models in Stan
(This article was first published on Maxwell B. Joseph, and kindly contributed to Rbloggers)
Gaussian process (GP) models are computationally demanding for large datasets.
Much work has been done to avoid expensive matrix operations that arise in parameter estimation with larger datasets via sparse and/or reduced rank covariance matrices (Datta et al. 2016 provide a nice review).
What follows is an implementation of a spatial Gaussian predictive process Poisson GLM in Stan, following Finley et al. 2009: Improving the performance of predictive process modeling for large datasets, with a comparison to a full rank GP model in terms of execution time and MCMC efficiency.
Assume we have $n$ spatially referenced counts $y(s)$ made at spatial locations $s_1, s_2, …, s_n$, which depend on a latent mean zero Gaussian process with an isotropic stationary exponential covariance function:
y(s) \sim \text{Poisson}(\text{exp}(X^T(s) \beta + w(s)))
w(s) \sim GP(0, C(d))
[C(d)]_{i, j} = \eta^2 \text{exp}( d_{ij} \phi)) + I(i = j) \sigma^2
where $y(s)$ is the response at location $s$, $X$ is an $n \times p$ design matrix, $\beta$ is a length $p$ parameter vector, $\sigma^2$ is a “nugget” parameter, $\eta^2$ is the variance parameter of the Gaussian process, $d_{ij}$ is a spatial distance between locations $s_i$ and $s_j$, and $\phi$ determines how quickly the correlation in $w$ decays as distance increases.
For simplicity, the point locations are uniformly distributed in a 2d unit square spatial region.
To estimate this model in a Bayesian context, we might be faced with taking a Cholesky decomposition of the $n \times n$ matrix $C(d)$ at every iteration in an MCMC algorithm, a costly operation for large $n$.
Gaussian predictive process representationComputational benefits of Gaussian predictive process models arise from the estimation of the latent Gaussian process at $m « n$ locations (knots).
Instead of taking the Cholesky factorization of the $n \times n$ covariance matrix, we instead factorize the $m \times m$ covariance matrix corresponding to the covariance in the latent spatial process among knots.
Knot placement is a nontrivial topic, but for the purpose of illustration let’s place knots on a grid over the spatial region of interest.
Note that here the knots are fixed, but it is possible to model knot locations stochastically as in Guhaniyogi et al. 2012.
Knots are shown as stars and the points are observations.
We will replace $w$ above with an estimate of $w$ that is derived from a reduced rank representation of the latent spatial process.
Below, the vector $\tilde{\boldsymbol{\epsilon}}$ corrects for bias (underestimation of $\eta$ and overestimation of $\sigma$) as an extension of Banerjee et al. 2008, and $\mathcal{C}^T(\theta) \mathcal{C}^{*1}(\theta)$ relates $w$ at desired point locations to the value of the latent GP at the knot locations, where $\mathcal{C}^T(\theta)$ is an $n \times m$ matrix that gets multiplied by the inverse of the $m \times m$ covariance matrix for the latent spatial process at the knots.
For a complete and general description see Finley et al. 2009, but here is the jist of the univariate model:
\boldsymbol{Y} \sim \text{Poisson}(\boldsymbol{X} \beta + \mathcal{C}^T(\theta) \mathcal{C}^{*1}(\theta)\boldsymbol{w^*} + \tilde{\boldsymbol{\epsilon}})
\boldsymbol{w^*} \sim GP(0, \mathcal{C}^*(\theta))
\tilde{\epsilon}(s) \stackrel{indep}{\sim} N(0, C(s, s)  \mathcal{c}^T(\theta) \mathcal{C}^{*1}(\theta))\mathcal{c}(\theta)
with priors for $\eta$, $\sigma$, and $\phi$ completing a Bayesian specification.
This approach scales well for larger datasets relative to a full rank GP model.
Comparing the number of effective samples per unit time for the two approaches across a range of sample sizes, the GPP executes more quickly and is more efficient for larger datasets.
Code for this simulation is available on GitHub: https://github.com/mbjoseph/gppspeedtest.
To leave a comment for the author, please follow the link and comment on their blog: Maxwell B. Joseph. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Labeling Opportunities in Price Series
(This article was first published on R – Quintuitive, and kindly contributed to Rbloggers)
One approach to trading which has been puzzling me lately, is to sit and wait for opportunities. Sounds simplistic, but it is indeed different than, for instance, the asset allocation strategies. In order to be able to even attempt taking advantage of these opportunities, however, we must be able to identify them. Once the opportunities are identified – we can try to explain (forecast) them using historical data.
The first step is to define what an opportunity is. It could be anything, for instance we can start with a strong directional move, but then we need to define what a strong directional move is. Let’s give it a try.
We can define a strong directional move as a 5% move over 3 days for instance. However, that’s not likely to work for different markets (think futures). 5% move is a lot in treasuries, but not so much in oil. The solution is to normalize the returns using volatility. I will use exponential moving average for that.
Here is a short code snippet:
adjust.returns = function(returns, vola.len=35) { vola = sqrt(EMA(returns * returns, n = vola.len)) return(returns/vola) } good.entries = function(ohlc, days.out=15, vola.len=35, days.pos=0.6) { stopifnot(days.out > 3) hi = Hi(ohlc) lo = Lo(ohlc) cl = Cl(ohlc) rets = ROC(cl, type='discrete') arets = abs(adjust.returns(rets, vola.len=vola.len)) res = rep(0, NROW(ohlc)) days = rep(0, NROW(ohlc)) for(ii in 3:days.out) { hh = lag.xts(runMax(hi, n=ii), ii) ll = lag.xts(runMin(lo, n=ii), ii) hi.ratio = (hh/cl  1)/arets lo.ratio = (ll/cl  1)/arets # days required dd = ceiling(days.pos*ii) longs = hi.ratio > dd & lo.ratio < dd longs = ifelse(longs, 1, 0) shorts = lo.ratio > dd & hi.ratio < dd shorts = ifelse(shorts, 1, 0) both = ifelse(longs == 1, 1, shorts) new.days = ii*as.numeric(res == 0 & both != 0) res = ifelse(res != 0, res, both) days = ifelse(days != 0, days, new.days) } return(data.frame(entry=res, days=days)) }The code basically defines an opportunity as a directional move over X days (X is between 3 and 15). The minimum move we would accept is computed by taking a percentage of the days (60% by default), rounding that number up, and multiplying it by the standard deviation of the normalized returns. We also require that the move in the opposite direction is smaller. Hard to explain, but hopefully you get the idea.
The function returns a label for each day: 0, 1, or 1. The opportunities are the 1 and 1 days of course.
Is this tradeable? Absolutely. When an opportunity arises, we enter in the suggested direction. The exit is a profit target, we can use a stop loss of the same magnitude.
Why all this effort again? Simple – these labels can be used as the response, or the dependent variable in an attempt to explain (forecast) the opportunities using historical data. At this point what’s left is to generate what we think might constitute a good set of predictors, align the two sets appropriately (no data snooping) and the rest is standard machine learning.
That’s for another post though.
The post Labeling Opportunities in Price Series appeared first on Quintuitive.
To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Tuning Apache Spark for faster analysis with Microsoft R Server
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
My colleagues Max Kaznady, Jason Zhang, Arijit Tarafdar and Miguel Fierro recently posted a really useful guide with lots of tips to speed up prototyping models with Microsoft R Server on Apache Spark. These tips apply when using Spark on Azure HDInsight, where you can spin up a Spark cluster the cloud with Microsoft R installed on the head node and worker nodes with just a single request on the Azure portal:
The blog post provides detailed instructions, but the three main efficiency tips are:
Install RStudio Server on the head node. R Tools for Visual Studio doesn't yet support remote execution, and while you can interact with the R command line directly via SSH or Putty, having an IDE makes things much easier and faster. There's a simple guide to install RStudio Server on the head node, whcih you can then use remotely to drive both the head node and the worker nodes.
Use just the head node for iterating on your model. While you probably want to use all of the data to train your final model, you can speed up the iterative process of developing your model (selecting variables, creating features and transformations, and comparing performance of different model types) by working with a sample of your data on the head node only. While this mightn't have the predictive power of using all the data, it's more than sufficient to determine variable importance and compare performance between competing models. Ultimately, the more iterations you have time for when developing the model, the better the final production model (trained on all the data) will be.
Tune the Spark cluster to optimize performance for model training. Now that you've saved time on the model development, save some time for the fulldata model training by tuning the Spark cluster. (This is particularly useful if the production model is going to be retrained on a regular basis, say on a weekly or overnight schedule.) Max and Arjit provide some guidelines for setting the onheap and offheap memory settings for each executor to maintain the ideal ratio.
You can read the full details at the Azure blog post linked below, and you can try tuning a sample analysis on the Taxi dataset yourself, using the R code provided on GitHub.
Azure HDInsight blog: Rapid Big Data Prototyping with Microsoft R Server on Apache Spark: Context Switching & Spark Tuning
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Combinations Exercises
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
When doing data analysis it happens often that we have a set of values and want to obtain various possible combinations of them. For example, taking 5 random samples from a dataset of 20. How many possible 5sample sets are there and how to obtain all of them? R has a bunch of functions that help with tasks like these: expand.grid, combn, outer and choose.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
You first throw a coin with 2 possible outcomes: it either lands heads or tails. Then you throw a dice with 6 possible outcomes: 1,2,3,4,5 and 6.
Generate all possible results of your action.
Exercise 2
Generate a multiplication table for numbers ranging from 1 to 10.
Exercise 3
You have a set of card values:
values < c("A", 2, 3, 4, 5, 6, 7, 8, 9, 10, "J", "Q", "K")
and a set of suits representing diamonds, clubs, spades and hearths:
suits < c("d", "c", "s", "h")
Generate a deck of playing cards. (e.g. King of spades should be represented as Ks).
Note: function paste(..., sep="") can be used to combine characters.
Exercise 4
Think about a game of poker using a standard deck of cards like the one generated earlier. Starting hand in poker consists of 5 cards and their order does not matter. How many different starting hands are there in total?
Exercise 5
You have a set of colors to choose from:
colors < c("red", "blue", "green", "white", "black", "yellow")
You have to pick 3 colors and you cant’ pick the same color more than once. List all possible combinations.
Exercise 6
Using the same set of colors – pick 3 without picking the same more than once, just like in the previous exercise.
List all possible combinations but this time sort each combination alphabetically.
Exercise 7
You have the same choices of colors and have to pick 3 but this time you can pick the same color more than once.
List all possible combinations.
Exercise 8
You have the same set of colors but this time instead of having to pick 3 you can choose to pick either 1, 2 or 3.
How many different choices can you make?
Exercise 9
You have the same set of colors and you can choose to pick either 1, 2 or 3.
Make a list of all possible choices.
Exercise 10
There are 3 color palletes: the first one has 4 colors, the second has 6 colors and the third has 8 colors. You have to pick a pallete and then choose up to 5 (1, 2, 3, 4 or 5) colors from the chosen color pallete. How many different possibilities are there?
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Project package libraries and reproducibility
(This article was first published on Mango Solutions » R Blog, and kindly contributed to Rbloggers)
Gábor Csárdi, Consultant, Mango Solutions
IntroductionIf you are an R user it has probably happened to you that you upgraded some R package in your R installation, and then suddenly your R script or application stopped working.
R packages change over time, and often they are not backwards compatible. This is normal and needed for innovation. But it is also annoying for end users. What can R users do to solve this problem?
Project package librariesOne strategy is that you create a new package library for a new project. A package library is just a directory that holds all installed R packages. (In addition to the ones that are installed with R itself.) There are various ways to configure R to use a separate package library, e.g. by setting the R_LIBS environment variable, or via the .libPaths() function. (Read more about this on the .libPaths() manual page.
A separate package library for each project ensures that updating packages for one project does not affect the wellbeing of another project, so this is a good first step. It also allows using different versions of the same R packages in different projects.
Once the project is done and works well in this specific environment (or the deadline is over), you need to make sure that you do not update your R packages in the project library any more, not giving any chance for your application to break.
The pkgsnap packageIt often happens, that you need to transfer your application to another computer, e.g. from the development environment to production, or you just want to share it with another developer. This means creating a project package library in the new environment, and then installing the same versions of the same packages from CRAN.
This is surprisingly challenging using the tools in base R. R’s package management tools are very robust, and great at installing the latest versions of the packages reliably. But installing certain package versions, and certain versions of their dependencies is not something they can do, at least not directly.
This is why we created the pkgsnap tool. This is a very simple package with two exported functions:
 snap takes a snapshot of your project library. It writes out the names and versions of the currently installed packages into a text file. You can put this text file into the version control repository of the project, to make sure it is not lost.
 restore uses the snapshot file to recreate the package project library from scratch. It installs the recorded versions of the recorded packages, in the right order.
Here is a little demonstration of how pkgsnap works. First we create a project package library, and install pkgsnap there.
lib_dir < tempfile()
dir.create(lib_dir)
devtools::install_github("mangothecat/pkgsnap", lib = lib_dir)
We instruct R to use this library directory. This is ideally done from a local .Rprofile file in your project directory. We also load pkgsnap.
.libPaths(lib_dir)
library(pkgsnap)
Currently the new package library is empty, we will install some R packages from CRAN into it. The packages they depend on will be also installed.
install.packages(c("testthat", "pkgconfig"))
installed.packages(lib_dir)[, c("Package", "Version")]
#> Package Version
#> crayon "crayon" "1.3.1"
#> digest "digest" "0.6.8"
#> memoise "memoise" "0.2.1"
#> pkgconfig "pkgconfig" "2.0.0"
#> pkgsnap "pkgsnap" "1.0.0"
#> praise "praise" "1.0.0"
#> testthat "testthat" "0.11.0"
To create a snapshot (to a temporary file this time), you can just call snap with the name of the snapshot file.
snapshot < tempfile()
snap(to = snapshot)
The packages and their versions are now recorded in the snapshot file. To demonstrate restoration we scrap our package library, and create a new one:
unlink(lib_dir, recursive = TRUE)
new_lib_dir < tempfile()
dir.create(new_lib_dir)
.libPaths(new_lib_dir)
We are ready to run restore:
restore(snapshot)
#> Downloading
#> crayon_1.3.1.tgz... done.
#> digest_0.6.8.tgz... done.
#> memoise_0.2.1.tgz... done.
#> pkgconfig_2.0.0.tgz... done.
#> pkgsnap_1.0.0.tgz... pkgsnap_1.0.0.tar.gz... pkgsnap_1.0.0.tar.gz... ERROR.
#> praise_1.0.0.tgz... done.
#> testthat_0.11.0.tgz... done.
#> Installing
#> digest_0.6.8.tgz ... done.
#> memoise_0.2.1.tgz ... done.
#> crayon_1.3.1.tgz ... done.
#> pkgconfig_2.0.0.tgz ... done.
#> praise_1.0.0.tgz ... done.
#> testthat_0.11.0.tgz ... done.
Note that we cannot restore the pkgsnap package, because it is not on CRAN (yet). The rest of the packages were downloaded and installed:
installed.packages(new_lib_dir)[, c("Package", "Version")]
#> Package Version
#> crayon "crayon" "1.3.1"
#> digest "digest" "0.6.8"
#> memoise "memoise" "0.2.1"
#> pkgconfig "pkgconfig" "2.0.0"
#> praise "praise" "1.0.0"
#> testthat "testthat" "0.11.0"
packrat is a package dependency management system by RStudio. It is (a lot) more sophisticated and has more features than the simple approach shown here. See more at https://rstudio.github.io/packrat
MRAN is a project by Revolution Analytics, now Microsoft. It allows you to go back in time and install CRAN packages as they were current on a given date. See more at https://mran.revolutionanalytics.com/
Summarypkgsnap is an R package that helps you recreate package libraries. pkgsnap has no compiled code, and no dependencies, so you can install it easily. It is curently available from GitHub: https://github.com/mangothecat/pkgsnap. You can install it via the devtools package:
devtools::install_github("mangothecat/pkgsnap")
Give it a try if you like the idea and tell us what you think.
To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...