R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 2 hours 14 min ago

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

Sat, 2016-06-18 20:00

(This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

Feature selection is a process of extracting valuable features that have significant influence on dependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Boruta and entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison on Venn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available via RTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("RTCGA.rnaseq") library(RTCGA.rnaseq) BRCA.rnaseq$bcr_patient_barcode <- substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC) registerDoMC(cores=6) library(glmnet) # fit the model cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]), y = factor(BRCA.rnaseq[, 1]), family = "binomial", type.measure = "class", parallel = TRUE) -> cvfit # extract feature names that have # non zero coefficiant names(which( coef(cvfit, s = "lambda.min")[, 1] != 0) )[-1] -> glmnet.features # first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them – lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

Boruta

The second feature selection algorithm, Boruta, is based on ranger – a fast implementation of random forests.
At start this algorithm
creates duplicated variables for each attribute in the model’s formula. Duplicates
then have randomly ordered values. Algorithm takes into consideration the Z-statistic of each variable in each tree and checks
whether the boxplot of Z-statistics for a variable is higher than the highest boxplot for
randomly ordered duplicate. Boruta is more sophisticated and its 9 page explanation can be found in Journal of Statistical Software. Below I present code that extracts valuable and tentative attributes

library(Boruta) # fit features selection algorithm system.time({ Boruta(as.factor(bcr_patient_barcode) ~., data = BRCA.rnaseq) -> Boruta.model }) # 20425 seconds ~ 35.5 min # archive result library(archivist) saveToRepo(Boruta.model, repoDir = "Museum") # correct names gsub(x = getSelectedAttributes(Boruta.model), pattern = "`", replacement = "", fixed = T) -> Boruta.features

There exist plot() overloaded function for Boruta class object which enables plotting boxplots of Z-statistics for each variable. It’s a good way to visualize the selection process. In this case there are over 20 thousand of attributes so I am using getSelectedAttributes function to extract valuable features. You could also extract tentative variables with withTentative = TRUE.

In case you run Boruta code yourself and it takes too long or you get such an error

Error: protect () : protection stack overflow

I have archived my model on GitHub, and you can load it to R global environment with archivist package

library(archivist) aread("MarcinKosinski/Museum/d3732742b989ed49666b0472ba52d705") -> Boruta.model

The history of Boruta decisions of rejecting and accepting features can be seen on such a graph

plotImpHistory(Boruta.model)

FSelectorRcpp::information_gain

We have realized that there are many sophisticated and computationally absorbent feature selection algorithms. In many cases the time is a huge blocker and we can’t wait for Boruta to finish its computing. For such cases easier algorithms are demanded. One of them is entropy based filter implemented in FSelector package and its faster Rcpp (free of Java/Weka) reimplementation in FSelectorRcpp package (by zzawadz) that also has support for sparse matrixes. If you would like to contribute to this package, please check our issues list on GitHub.

Information gain is a simple, linear algorithm that computes the entropy of dependent and explanatory variables, and the conditional entropy of dependent variable with respect to each explanatory variable separately. This simple statistic enables to calculate the belief of the distribution of dependent variable when we only know the distribution of explanatory variable. Below code shows how easy it is to use

devtools::install_github('mi2-warsaw/FSelectorRcpp') library(FSelectorRcpp) information_gain(y = BRCA.rnaseq[, 1], x = BRCA.rnaseq[, -1]) -> FSelectorRcpp.weights library(FSelector) cutoff.k.percent(FSelectorRcpp.weights, k = 0.01) -> FSelectorRcpp.features

Information gain algorithm returns scores for attributes and we will cut top 1 % features. For that we will use simple cutoff.k.percent from FSelector.

Venn Diagram

Finally we can visualize differences in decisions of algorithms. One way is a Venn Diagram that (after Wikipedia)

shows all possible logical relations between a finite collection of different sets.

named.list.for.venn <- list(` Boruta` = Boruta.features, `GLMnet ` = glmnet.features, FSelectorRcpp = FSelectorRcpp.features) library(VennDiagram) venn.diagram( x = named.list.for.venn, filename = "venn.png", imagetype = "png", #col = "transparent",, col = "black", lty = "dotted", lwd = 4, fill = c("cornflowerblue", "green", "yellow"), alpha = 0.50, label.col = c("orange", "white", "orange", "white", "darkblue", "white", "darkgreen"), cex = 2.5, fontfamily = "serif", fontface = "bold", cat.col = c("darkblue", "darkgreen", "orange"), cat.cex = 2.5, cat.fontfamily = "serif" )

The effect of this code can be seen on the welcome graph of this post.
We can see that only 1 variable was chosen with all 3 algorithms. VennDiagram package allows to find out the names of intersections.

get.venn.partitions(named.list.for.venn) -> partitions apply(partitions[, c(1,2,3)], MARGIN = 1, function(configuration) { all(configuration == rep(T,3)) }) -> all.algorithms partitions[all.algorithms, ] Boruta GLMnet FSelectorRcpp ..set.. ..values.. ..count.. 1 TRUE TRUE TRUE Boruta∩GLMnet ∩FSelectorRcpp ABCA10|10349 1

Do you have any comments about the diagram or the unbelievable result that only 1 variable overlaps all 3 algorithms? Maybe some algorithms can be improved? Is this the case of random nature of cv.glmnet or Boruta? Feel free to leave your comments on the Disqus panel below.

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The ffanalytics R Package for Fantasy Football Data Analysis

Sat, 2016-06-18 14:45

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Introduction

We are continuously looking to provide users ways to replicate our analyses and improve their performance in fantasy football. To that aim, we are introducing the ffanalytics R package that includes a streamlined version of the scripts used to scrape projections from multiple sources and calculate projected points using the wisdom of the crowd.

What Software You Need
  • R (for running the scripts)
  • RStudio (best text editor for viewing, editing, and running R scripts)
Installation

The package is available through our GitHub repository, and installation requires that you have already installed the devtools package. You can install the devtools package in R/RStudio via:

install.packages("devtools")

You can then install the ffanalytics R package via:

devtools::install_github(repo = "dadrivr/FantasyFootballAnalyticsR",  subdir = "R Package/ffanalytics")

There is also a tarball available in the GitHub repository that you can download and install with install.packages(). After the package is installed you can then load the package via:

library("ffanalytics") Package manual

The package includes a reference manual in standard package format. You can also find this in the help pane in RStudio. The full PDF version of the manual is here.

Addins

The ffanalytics package includes RStudio addins to help you easily scrape and calculate projections. To use the addins, you will need to run RStudio (v0.99.878 or later). The first addin Run Scrape will just scrape projections while Run Projections will scrape projection and calculate projected points. The addins will construct code and return them to the console in RStudio to be executed. Because the Run Projections addin contains functionality of the Run Scrape addin, we will just focus on the projections addin here.

Run Projections addin

The Run Projections addin consists of three tabs that you will need to review before executing the function to scrape and calculate projections.

1. Scrape Settings

When you start the projections addin, you should see the tab where you can select which season, week, sources, and positions you want to scrape. Select the season and week you want to scrape (week=0 for seasonal projections). Then select the analysts and positions you want to scrape. The function has the ability to include subscription based sources, but you will need to either download subscription projections separately or provide a user name and password for those sites. Note that historical data scrapes are nearly impossible to do as sites usually don’t store their historical projections.

 

2. Scoring Settings

Review and set the scoring settings for the calculations on the scoring settings tab. There is a panel for each position with the names of the variables to score. These settings are similar to the ones in our Projections tool. Use the scoring setting that match your league’s.

3. Calculation Settings

On the calculations tab, select which average you want to use for aggregating the sources and calculating projected points. Then, select which sources of ADP values to use. It also allows you to select which MFL drafts types to use when getting the ADP data.  Tiers are calculated using effect size thresholds based on Cohen’s d.  Smaller d values result in more tiers (fewer players per tier).

Execute R code

You can also execute the code to run directly from R or RStudio. If you want to run a data scrape then you can execute, e.g.:

runScrape(week = 0, season = 2016, analysts = c(-1, 5, 7, 18, 27), positions = c("QB", "RB", "WR", "TE", "K", "DST"))

This scrapes seasonal projections for the 2016 season for 5 different analysts for the Non-IDP positions. Refer to the analysts table in the package to identify the analysts you want to use.

If you want to calculate projections from R or RStudio, you will need the output from the runScrape() function. Then you can execute:

getProjections(scrapeData, avgMethod = "average", leagueScoring = userScoring, teams = 12, format = "standard", mflMocks = -1, mflLeagues = -1, adpSources = c("CBS", "ESPN", "FFC", "MFL", "NFL"))

where scrapeData is the output from the runScrape() function and userScoring is your defined scoring rules. See scoringRules in the package for how to define scoring rules.

Troubleshooting

If you run into errors or issues, feel free to let us know in the comments or to open a GitHub issue on our GitHub repo.  If you are able to find and fix an issue, please share the fix with the community (for more info how to share your scripts with the community, see here).

Downloading Our R Scripts

In addition to the ffanalytics R Package, we also have R scripts that accompany posts on the site.  To download and run our R scripts, see here.  If you’d rather calculate projections using our webapps without the R package, see here.  We encourage people to use and improve our scripts, and to share them with the community so everyone benefits.  For info on how to share your scripts with the community, see here.

The post The ffanalytics R Package for Fantasy Football Data Analysis appeared first on Fantasy Football Analytics.

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

PSA: R’s rnorm() and mvrnorm() use different spreads

Fri, 2016-06-17 21:03

(This article was first published on R – Civil Statistician, and kindly contributed to R-bloggers)

Quick public service announcement for my fellow R nerds:

R has two commonly-used random-Normal generators: rnorm and MASS::mvrnorm. I was foolish and assumed that their parameterizations were equivalent when you’re generating univariate data. But nope:

  • Base R can generate univariate draws with rnorm(n, mean, sd), which uses the standard deviation for the spread.
  • The MASS package has a multivariate equivalent, mvrnorm(n, mu, Sigma), which uses the variance-covariance matrix for the spread. In the univariate case, Sigma is the variance.

I was using mvrnorm to generate a univariate random variable, but giving it the standard deviation instead of the variance. It took me two weeks of debugging to find this problem.

Dear reader, I hope this cautionary tale reminds you to check R function arguments carefully!

To leave a comment for the author, please follow the link and comment on their blog: R – Civil Statistician. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

Fri, 2016-06-17 20:00

(This article was first published on http://r-addict.com, and kindly contributed to R-bloggers)

Feature selection is a process of extracting valuable features that have significant influence on dependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Boruta and entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison on Venn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available via RTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported source("https://bioconductor.org/biocLite.R") biocLite("RTCGA.rnaseq") library(RTCGA.rnaseq) BRCA.rnaseq$bcr_patient_barcode <- substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC) registerDoMC(cores=6) library(glmnet) # fit the model cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]), y = factor(BRCA.rnaseq[, 1]), family = "binomial", type.measure = "class", parallel = TRUE) -> cvfit # extract feature names that have # non zero coefficiant names(which( coef(cvfit, s = "lambda.min")[, 1] != 0) )[-1] -> glmnet.features # first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them – lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

Boruta

The second feature selection algorithm, Boruta, is based on ranger – a fast implementation of random forests.
At start this algorithm
creates duplicated variables for each attribute in the model’s formula. Duplicates
then have randomly ordered values. Algorithm takes into consideration the Z-statistic of each variable in each tree and checks
whether the boxplot of Z-statistics for a variable is higher than the highest boxplot for
randomly ordered duplicate. Boruta is more sophisticated and its 9 page explanation can be found in Journal of Statistical Software. Below I present code that extracts valuable and tentative attributes

library(Boruta) # fit features selection algorithm system.time({ Boruta(as.factor(bcr_patient_barcode) ~., data = BRCA.rnaseq) -> Boruta.model }) # 20425 seconds ~ 35.5 min # archive result library(archivist) saveToRepo(Boruta.model, repoDir = "Museum") # correct names gsub(x = getSelectedAttributes(Boruta.model), pattern = "`", replacement = "", fixed = T) -> Boruta.features

There exist plot() overloaded function for Boruta class object which enables plotting boxplots of Z-statistics for each variable. It’s a good way to visualize the selection process. In this case there are over 20 thousand of attributes so I am using getSelectedAttributes function to extract valuable features. You could also extract tentative variables with withTentative = TRUE.

In case you run Boruta code yourself and it takes too long or you get such an error

Error: protect () : protection stack overflow

I have archived my model on GitHub, and you can load it to R global environment with archivist package

library(archivist) aread("MarcinKosinski/Museum/d3732742b989ed49666b0472ba52d705") -> Boruta.model

The history of Boruta decisions of rejecting and accepting features can be seen on such a graph

plotImpHistory(Boruta.model)

FSelectorRcpp::information_gain

We have realized that there are many sophisticated and computationally absorbent feature selection algorithms. In many cases the time is a huge blocker and we can’t wait for Boruta to finish its computing. For such cases easier algorithms are demanded. One of them is entropy based filter implemented in FSelector package and its faster Rcpp (free of Java/Weka) reimplementation in FSelectorRcpp package (by zzawadz) that also has support for sparse matrixes. If you would like to contribute to this package, please check our issues list on GitHub.

Information gain is a simple, linear algorithm that computes the entropy of dependent and explanatory variables, and the conditional entropy of dependent variable with respect to each explanatory variable separately. This simple statistic enables to calculate the belief of the distribution of dependent variable when we only know the distribution of explanatory variable. Below code shows how easy it is to use

devtools::install_github('mi2-warsaw/FSelectorRcpp') library(FSelectorRcpp) information_gain(y = BRCA.rnaseq[, 1], x = BRCA.rnaseq[, -1]) -> FSelectorRcpp.weights library(FSelector) cutoff.k.percent(FSelectorRcpp.weights, k = 0.01) -> FSelectorRcpp.features

Information gain algorithm returns scores for attributes and we will cut top 1 % features. For that we will use simple cutoff.k.percent from FSelector.

Venn Diagram

Finally we can visualize differences in decisions of algorithms. One way is a Venn Diagram that (after Wikipedia)

shows all possible logical relations between a finite collection of different sets.

named.list.for.venn <- list(` Boruta` = Boruta.features, `GLMnet ` = glmnet.features, FSelectorRcpp = FSelectorRcpp.features) library(VennDiagram) venn.diagram( x = named.list.for.venn, filename = "venn.png", imagetype = "png", #col = "transparent",, col = "black", lty = "dotted", lwd = 4, fill = c("cornflowerblue", "green", "yellow"), alpha = 0.50, label.col = c("orange", "white", "orange", "white", "darkblue", "white", "darkgreen"), cex = 2.5, fontfamily = "serif", fontface = "bold", cat.col = c("darkblue", "darkgreen", "orange"), cat.cex = 2.5, cat.fontfamily = "serif" )

The effect of this code can be seen on the welcome graph of this post.
We can see that only 1 variable was chosen with all 3 algorithms. VennDiagram package allows to find out the names of intersections.

get.venn.partitions(named.list.for.venn) -> partitions apply(partitions[, c(1,2,3)], MARGIN = 1, function(configuration) { all(configuration == rep(T,3)) }) -> all.algorithms partitions[all.algorithms, ] Boruta GLMnet FSelectorRcpp ..set.. ..values.. ..count.. 1 TRUE TRUE TRUE Boruta∩GLMnet ∩FSelectorRcpp ABCA10|10349 1

Do you have any comments about the diagram or the unbelievable result that only 1 variable overlaps all 3 algorithms? Maybe some algorithms can be improved? Is this the case of random nature of cv.glmnet or Boruta? Feel free to leave your comments on the Disqus panel below.

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R packaging industry close-up: How fast are we growing?

Fri, 2016-06-17 20:00

(This article was first published on Daniel's Blog, and kindly contributed to R-bloggers)

I worked a bit over the weekend preparing my talk to be delivered at the seminar organized by IBPAD this week at University of Brasilia, addressing the interfaces of Big Data and Society.
I was invited to present the R package SciencesPo for an eclectic crowd. Eclectic in terms of background as well as familiarity with R, so, I thought it would be a fair introduction to call the audience’s attention to the R ecosystem, particularly, the growing number of specialized packages made available through CRAN.

I gathered some log data from package downloads to produce the following figure. The main plot shows the number of published packages since 2005 (that are still available). Notice that the y-axis is in log scale. The small multiples inside also shows the count of packages published on CRAN, but only for packages submitted after 2013-01-01. It’s an arbitrary date that makes my job of estimating a growth rate of package submission a lot easier. The red line represents the modeled growth rate estimated for the period with an approximation of 5.6%/month.

Needed packages library(ggplot2) library(grid) library(rvest) library(dplyr) library(zoo) library(SciencesPo) Data manipulation url <- "https://cran.r-project.org/web/packages/available_packages_by_date.html" page <- read_html(url) page %>% html_node("table") %>% html_table() %>% mutate(count = rev(1:nrow(.))) %>% mutate(Date = as.Date(Date)) %>% mutate(Month = format(Date, format = "%Y-%m")) %>% group_by(Month) %>% summarise(published = min(count)) %>% mutate(Date = as.Date(as.yearmon(Month))) -> pkgs The main plot gg <- ggplot(pkgs, aes(x = Date, y = published)) gg <- gg + geom_line(size = 1.5) gg <- gg + scale_y_log10( breaks = c(0, 10, 100, 1000, 10000), labels = c("1", "10", "100", "1.000", "10.000")) gg <- gg + labs(x = "", y = "# Packages (log)", title = "Packages published on CRAN ever since") gg <- gg + theme_538(base_size = 14, base_family = "Tahoma") gg <- gg + theme(panel.grid.major.x = element_blank()) gg <- gg + geom_hline(yintercept = 0, size = 1, colour = "#535353") gg pkgs %>% filter(Date > as.Date("2012-12-31")) %>% mutate(publishedGrowth = c(tail(.$published,-1), NA) / published) %>% mutate(counter = 1:nrow(.)) -> new_pkgs The small multiples plot gg2 <- ggplot(new_pkgs, aes(x = Date, y = published)) gg2 <- gg2 + geom_line(size = 1) gg2 <- gg2 + geom_line(data = new_pkgs, aes(y = (min(published) * 1.056 ^ counter)), color = 'red',size = .7, linetype = 1) gg2 <- gg2 + annotate("segment", x = as.Date("2014-08-01"), xend = as.Date("2014-11-01"), y = 500, yend = 500, colour = "red", size = 1) gg2 <- gg2 + annotate("text", x = as.Date("2015-10-01"), y = 500, label = "5.6% growth estimation", size = 3.5) gg2 <- gg2 + scale_y_continuous() gg2 <- gg2 + labs(y = "# Packages", x = "", title = "Packages published on CRAN since 2013") gg2 <- gg2 + theme_538(legend = "top", base_size = 11, base_family = "Tahoma", colors = c("gray97", "#D2D2D2", "#2b2b2b", "#2b2b2b")) gg2 <- gg2 + theme(panel.grid.major.x = element_blank()) gg2 <- gg2 + geom_hline(yintercept = 0, size = .6, colour = "#535353") gg2 gg print(gg2, vp=viewport(.775, .31, .43, .43))

To leave a comment for the author, please follow the link and comment on their blog: Daniel's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Can’t compute the standard deviation in your head? Divide the range by four.

Fri, 2016-06-17 13:31

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

TESTING A HEURISTIC TO ESTIMATE STANDARD DEVIATION


Click to enlarge

Say you’ve got 30 numbers and a strong urge to estimate their standard deviation. But you’ve left your computer at home. Unless you’re really good at mentally squaring and summing, it’s pretty hard to compute a standard deviation in your head. But there’s a heuristic you can use:

Subtract the smallest number from the largest number and divide by four

Let’s call it the “range over four” heuristic. You could, and probably should, be skeptical. You could want to see how accurate the heuristic is. And you could want to see how the heuristic’s accuracy depends on the distribution of numbers you are dealing with.

Fine.

We generated  random numbers from four distributions, pictured above. We nickname them (from top to bottom): floor, left of center, normalish, and uniform. They’re all beta distributions. If you want more detail, they’re the same beta distributions studied in Goldstein and Rothschild (2014). See the code below for parameters.

We vary two things in our simulation:
1) The number of observations on which we’re estimating the standard deviation.
2) The distributions from which the observations are drawn

With each sample, we compute the actual standard deviation and compare it to the heuristic’s estimate of the standard deviation. We do this many times and take the average. Because we like the way mape sounds, we used mean absolute percent error (MAPE) as our error metric. Enough messing around. Let’s show the result.


Click to enlarge

There you have it. With about 30 to 40 observations, we could get an average absolute error of less than 10 percent for three of our distributions, even the skewed ones. With more observations, the error grew for those distributions.

With the uniform distribution, error was over 15 percent in the 30-40 observation range. We’re fine with that. We don’t tend to measure too many things that are uniformly distributed.

Another thing that set the uniform distribution apart is that its error continued to go down as more observations were added. Why is this? The standard deviation of a uniform distribution between 0 and 1 is 1/sqrt(12) or 0.289. The heuristic, if it were lucky enough to draw 1 and a 0 as its sample range, would estimate the standard deviation as 1/4 or .25. So, the sample size increases, the error for the uniform distribution should drop down to a MAPE of 13.4% and flatten out. The graph shows it is well on its way towards doing so.

Want to play with it yourself? R Code below. Thanks to Hadley Wickham for creating tools like dplyr and ggplot2 which take R to the next level.

The post Can’t compute the standard deviation in your head? Divide the range by four. appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

DataCamp course: Importing and managing financial data

Fri, 2016-06-17 12:32

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

The team at DataCamp announced a new R/Finance course series in a recent email:

Subject: Data Mining Tutorial, R/Finance course series, and more!

R/Finance – A new course series in the works
We are working on a whole new course series on applied finance using R. This new series will cover topics such as time series (David S. Matteson), portfolio analysis (Kris Boudt), the xts and zoo packages (Jeffrey Ryan), and much more. Start our first course Intro to Credit Risk Modeling in R today.

I’m excited to announce that I’m working on a course for this new series! It will provide an introduction to importing and managing financial data.

If you’ve ever done anything with financial or economic time series, you know the data come in various shapes, sizes, and periodicities. Getting the data into R can be stressful and time-consuming, especially when you need to merge data from several different sources into one data set. This course will cover importing data from local files as well as from internet sources.

The tentative course outline is below. I’d really appreciate your feedback on what should be included in this introductory course! So let me know if I’ve omitted something, or if you think any of the topics are too advanced.

Introduction to importing and managing financial data

  1. Introduction and downloading data
    1. getSymbols design overview, Quandl
    2. Finding and downloading data from internet sources
      1. E.g. getSymbols.yahoo, getSymbols.FRED, Quandl
    3. Loading and transforming multiple instruments
    4. Checking for errors (i.e. summary stats, visualizing)
  2. Managing data from multiple sources
    1. Setting per-instrument sources and default arguments
      1. setSymbolLookup, saveSymbolLookup, loadSymbolLookup, setDefaults
    2. Handling instruments names that clash or are not valid R object names
  3. Aligning data with different periodicities
    1. Making irregular data regular
    2. Aggregating to lowest frequency
    3. Combining monthly with daily
    4. Combining daily with intraday
  4. Storing and updating data
    1. Creating an initial RData-backed storage
    2. Adjusting financial time-series
    3. Handling errors during update process

To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Data Journalism Awards Data Visualization of the Year, 2016

Fri, 2016-06-17 11:50

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Congratulations to Peter Aldhous and Charles Seife of Buzzfeed News, winners of the 2016 Data Journalism Award for Data Visualization of the Year. They were recognized by their reporting for Spies in the Sky, which analyzed FAA air traffic records to visulize the domestic surveillance activities of the US government.

Aldhouse and Seife used the R language to create the visualiations, so this award is a testament both to their top-notch reporting skills and the flexibility of the R language to process an unusual and complex data source, and then render the results into a compelling visual studio. Kudos.

Simon Roger reviewed the DJA award winners and also offered his commendations:

Buzzfeed’s Peter Aldhous and Charles Seife won the prize for dataviz of the year for their project on light aircraft/helicopter surveillance over America, Spies in the Skies. The judges said it was a “fantastic example of the layers and depth that only visualization can bring to a story”. It works because it’s more than just beautiful visuals — it adds another dimension to the story and makes it stronger. It’s also not a standalone visual, but part of a much bigger piece of work reporting the issue in forensic detail.

To see the complete list of awardees, follow the link below.

GEN: Data Journalism Awards 2016 (via Alberto Cairo)

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Introducing xda: R package for exploratory data analysis

Fri, 2016-06-17 03:35

(This article was first published on R Language – the data science blog, and kindly contributed to R-bloggers)

This R package contains several tools to perform initial exploratory analysis on any input dataset. It includes custom functions for plotting the data as well as performing different kinds of analyses such as univariate, bivariate and multivariate investigation which is the first step of any predictive modeling pipeline. This package can be used to get a good sense of any dataset before jumping on to building predictive models. You can install the package from GitHub.

The functions currently included in the package are mentioned below:

  • numSummary(mydata) function automatically detects all numeric columns in the dataframe mydata and provides their summary statistics
  • charSummary(mydata) function automatically detects all character columns in the dataframe mydata and provides their summary statistics
  • Plot(mydata, dep.var) plots all independent variables in the dataframe mydata against the dependant variable specified by the dep.var parameter
  • removeSpecial(mydata, vec) replaces all special characters (specified by vector vec) in the dataframe mydata with NA
  • bivariate(mydata, dep.var, indep.var) performs bivariate analysis between dependent variable dep.var and independent variable indep.var in the dataframe mydata
Installation

To install the xda package, devtools package needs to be installed first. To install devtools, please follow instructions here.

Then, use the following commands to install xda:

library(devtools) install_github("ujjwalkarn/xda") Usage

For all examples below, the popular iris dataset has been used. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

library(xda) ## to view a comprehensive summary for all numeric columns in the iris dataset numSummary(iris) ## n = total number of rows for that variable ## miss = number of rows with missing value ## miss% = percentage of total rows with missing values ((miss/n)*100) ## 5% = 5th percentile value of that variable (value below which 5 percent of the observations may be found) ## the percentile values are helpful in detecting outliers

## to view a comprehensive summary for all character columns in the iris dataset charSummary(iris) ## n = total number of rows for that variable ## miss = number of rows with missing value ## miss% = percentage of total rows with missing values ((n/miss)*100) ## unique = number of unique levels of that variable ## note that there is only one character column (Species) in the iris dataset

## to perform bivariate analysis between 'Species' and 'Sepal.Length' in the iris dataset bivariate(iris,'Species','Sepal.Length') ## bin_Sepal.Length = 'Sepal.Length' variable has been binned into 4 equal intervals (original range is [4.3,7.9]) ## for each interval of 'Sepal.Length', the number of samples from each category of 'Species' is shown ## i.e. 39 of the 50 samples of Setosa have Sepal.Length is in the range (4.3,5.2], and so on. ## the number of intervals (4 in this case) can be customized (see documentation)

## to plot all other variables against the 'Petal.Length' variable in the iris dataset Plot(iris,'Petal.Length')

The package is constantly under development and more functionalities will be added soon. Will also add this to CRAN in the coming days. Pull requests to add more functions are welcome!

To leave a comment for the author, please follow the link and comment on their blog: R Language – the data science blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Summary Statistics With Aggregate()

Thu, 2016-06-16 13:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The aggregate() function subsets dataframes, and time series data, then computes summary statistics. The structure of the aggregate() function is aggregate(x, by, FUN).

Answers to the exercises are available here.

Exercise 1
Aggregate the “airquality” data by “airquality$Month“, returning means on each of the numeric variables. Also, remove “NA” values.

Exercise 2
Aggregate the “airquality” data by the variable “Day“, remove “NA” values, and return means on each of the numeric variables.

Exercise 3
Aggregate “airquality$Solar.R” by “Month“, returning means of “Solar.R“. The header of column 1 should be “Month“. Remove “not available” values.

Exercise 4
Apply the standard deviation function to the data aggregation from Exercise 3.

Exercise 5
The structure of the aggregate() formula interface is aggregate(formula, data, FUN).

The structure of the formula is y ~ x. The “y” variables are numeric data. The “x” variables, usually factors, are grouping variables, that subset the “y” variables.

aggregate.formula allows for one-to-one, one-to-many, many-to-one, and many-to-many aggregation.

Therefore, use aggregate.formula for a one-to-one aggregation of “airquality” by the mean of “Ozone” to the grouping variable “Day“.

Exercise 6
Use aggregate.formula for a many-to-one aggregation of “airquality” by the mean of “Solar.R” and “Ozone” by grouping variable, “Month“.

Exercise 7
Dot notation can replace the “y” or “x” variables in aggregate.formula. Therefore, use “.” dot notation to find the means of the numeric variables in airquality“, with the grouping variable of “Month“.

Exercise 8
Use dot notation to find the means of the “airquality” variables, with the grouping variables of “Day” and “Month“. Display only the first 6 resulting observations.

Exercise 9
Use dot notation to find the means of “Temp“, with the remaining “airquality” variables as grouping variables.

Exercise 10
aggregate.ts is the time series method for aggregate().

Using R‘s built-in time series dataset, “AirPassengers“, compute the average annual standard deviation.

Image by Averater (Own work) [CC BY-SA 3.0], via Wikimedia Commons.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Your data vis “Spidey-sense” & the need for a robust “utility belt”

Thu, 2016-06-16 10:01

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

@theboysmithy did a great piece on coming up with an alternate view for a timeline for an FT piece.

Here’s an excerpt (read the whole piece, though, it’s worth it):

Here is an example from a story recently featured in the FT: emerging- market populations are expected to age more rapidly than those in developed countries. The figures alone are compelling: France is expected to take 157 years (from 1865 to 2022) to triple the proportion of its population aged over 65, from 7 per cent to 21 per cent; for China, the equivalent period is likely to be just 34 years (from 2001 to 2035).

You may think that visualising this story is as simple as creating a bar chart of the durations ordered by length. In fact, we came across just such a chart from a research agency.

But, to me, this approach generates “the feeling” — and further scrutiny reveals specific problems. A reader must work hard to memorise the date information next to the country labels to work out if there is a relationship between the start date and the length of time taken for the population to age. The chart is clearly not ideal, but how do we improve it?

Alan went on to talk about the process of improving the vis, eventually turning to Joseph Priestly for inspiration. Here’s their makeover:

Alan used D3 to make this, which had me head scratching for a bit. Bostock is genius & I :heart: D3 immensely, but I never really thought of it as a “canvas” for doing general data visualization creation for something like a print publication (it’s geared towards making incredibly data-rich interactive visualizations). It’s 100% cool to do so, though. It has fine-grained control over every aspect of a visualization and you can easily turn SVGs into PDFs or use them in programs like Illustrator to make the final enhancements. However, D3 is not the only tool that can make a chart like this.

I made the following in R (of course):

The annotations in Alan’s image were (99% most likely) made with something like Illustrator. I stopped short of fully reproducing the image (life is super-crazy, still), but could have done so (the entire image is one ggplot2 object).

This isn’t an “R > D3” post, though, since I use both. It’s about (a) reinforcing Alan’s posits that we should absolutely take inspiration from historical vis pioneers (so read more!) + need a diverse visualization “utility belt” (ref: Batman) to ensure you have the necessary tools to make a given visualization; (b) trusting your “Spidey-sense” when it comes to evaluating your creations/decisions; and, (c) showing that R is a great alternative to D3 for something like this

Spider-man (you expected headier references from a dude with a shield avatar?) has this ability to sense danger right before it happens and if you’re making an effort to develop and share great visualizations, you definitely have this same sense in your DNA (though I would not recommend tossing pie charts at super-villains to stop them). When you’ve made something and it just doesn’t “feel right”, look to other sources of inspiration or reach out to your colleagues or the community for ideas or guidance. You can and do make awesome things, and you do have a “Spidey-sense”. You just need to listen to it more, add depth and breadth to your “utility belt” and keep improving with each creation you release into the wild.

R code for the ggplot vis reproduction is below, and it + the CSV file referenced are in this gist.

library(ggplot2) library(dplyr) ft <- read.csv("ftpop.csv", stringsAsFactors=FALSE) arrange(ft, start_year) %>% mutate(country=factor(country, levels=c(" ", rev(country), " "))) -> ft ft_labs <- data_frame( x=c(1900, 1950, 2000, 2050, 1900, 1950, 2000, 2050), y=c(rep(" ", 4), rep(" ", 4)), hj=c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5), vj=c(1, 1, 1, 1, 0, 0, 0, 0) ) ft_lines <- data_frame(x=c(1900, 1950, 2000, 2050)) ft_ticks <- data_frame(x=seq(1860, 2050, 10)) gg <- ggplot() # tick marks & gridlines gg <- gg + geom_segment(data=ft_lines, aes(x=x, xend=x, y=2, yend=16), linetype="dotted", size=0.15) gg <- gg + geom_segment(data=ft_ticks, aes(x=x, xend=x, y=16.9, yend=16.6), linetype="dotted", size=0.15) gg <- gg + geom_segment(data=ft_ticks, aes(x=x, xend=x, y=1.1, yend=1.4), linetype="dotted", size=0.15) # double & triple bars gg <- gg + geom_segment(data=ft, size=5, color="#b0657b", aes(x=start_year, xend=start_year+double, y=country, yend=country)) gg <- gg + geom_segment(data=ft, size=5, color="#eb9c9d", aes(x=start_year+double, xend=start_year+double+triple, y=country, yend=country)) # tick labels gg <- gg + geom_text(data=ft_labs, aes(x, y, label=x, hjust=hj, vjust=vj), size=3) # annotations gg <- gg + geom_label(data=data.frame(), hjust=0, label.size=0, size=3, aes(x=1911, y=7.5, label="France is set to take\n157 years to triple the\nproportion ot its\npopulation aged 65+,\nChina only 34 years")) gg <- gg + geom_curve(data=data.frame(), aes(x=1911, xend=1865, y=9, yend=15.5), curvature=-0.5, arrow=arrow(length=unit(0.03, "npc"))) gg <- gg + geom_curve(data=data.frame(), aes(x=1915, xend=2000, y=5.65, yend=5), curvature=0.25, arrow=arrow(length=unit(0.03, "npc"))) # pretty standard stuff here gg <- gg + scale_x_continuous(expand=c(0,0), limits=c(1860, 2060)) gg <- gg + scale_y_discrete(drop=FALSE) gg <- gg + labs(x=NULL, y=NULL, title="Emerging markets are ageing at a rapid rate", subtitle="Time taken for population aged 65 and over to double and triple in proportion (from 7% of total population)", caption="Source: http://on.ft.com/1Ys1W2H") gg <- gg + theme_minimal() gg <- gg + theme(axis.text.x=element_blank()) gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(plot.margin=margin(10,10,10,10)) gg <- gg + theme(plot.title=element_text(face="bold")) gg <- gg + theme(plot.subtitle=element_text(size=9.5, margin=margin(b=10))) gg <- gg + theme(plot.caption=element_text(size=7, margin=margin(t=-10))) gg

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Mapping US Counties in R with FIPS

Thu, 2016-06-16 09:28

(This article was first published on R Tricks – Data Science Riot!, and kindly contributed to R-bloggers)

Anyone who’s spent any time around data knows primary keys are your friend. Enter the FIPS code. FIPS is the Federal Information Processing Standard and appears in most data sets published by the US government.

Name Matching
The map below is an example as the “wrong way” to do something like this. This map uses a string matching technique to match US county names with the county names in the maps package. The map below can be replicated with this GitHub gist, but I don’t recommend it. I’m using an air quality data set from the Centers for Disease Control and Prevention.

Problems with string matching:
County names change
Louisiana often has “parishes” or “bayous”
Alaska often has “territories” or “census areas”

As you can see from above, many counties are missing. It’s possible to fix this with some fancy regex work, but if may take quite some time before you realize why Oglala Lakota County is missing from your base map!

FIPS Matching
The maps package contains a built-in data set that you can call with `county.fps`. The only problem with this is, you still end up string matching with your data set. I’ve found the best way to get a map with baked-in FIPS codes is to download (one of many) shape files provided by the Census Bureau. NOTE: There are also shape files for zip codes, congressional districts, census tracts, etc. The shape file we’re using can be downloaded here. Just unzip and place it in your working dir.

Note that these data don’t contain values for Alaska and Hawaii.

.gist table { margin-bottom: 0; }

Results

The data are taken from 2000 to 2010, and the shape file we’re using is from 2013. But since FIPS codes remain constant, even when county names change, every thing matches up just fine.

Other Advantages
Leaflet anyone? Another plus to shape files is, they are easily rendered to a leaflet map. I threw the below map together “quick and dirty.” I’m not really pleased with the green-to-red color ramp, but I’m sure that could be fixed by manually assigning color buckets. GGplot seems to have a better handle on color ramping straight out of the box.

This would be a cool map if your browser allowed Javascript. 

 

.gist table { margin-bottom: 0; }

To leave a comment for the author, please follow the link and comment on their blog: R Tricks – Data Science Riot!. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The R Packages of UseR! 2016

Thu, 2016-06-16 09:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

It is always a delight to discover a new and useful R package, and it is especially nice when the discovery comes with at context and testimonial to its effectiveness. It is also satisfying to be able to check in once in awhile and get an idea of what people think is hot, or current or trending in the R world. The schedule for the upcoming useR! conference at Stanford is a touchstone for both of these purposes. It lists 128 contributed talks over 3 days. Vetted for both content and quality by the program committee, these talks represent a snapshot of topics and research areas that are of current interest to R developers and researchers and also catalog the packages and tools that support the research. As best as I can determine, the abstracts for the talks explicitly call out 154 unique packages.

Some of the talks are all about introducing new R packages that have been very recently released or are still under development on Github. Others, mention the packages that represent the current suite of tools for a particular research area. Obviously, the abstracts do not list all of the packages employed by the researchers in their work, but presumably they do mention the packages that the speakers think are most important or in describing their work and attracting an audience.

The following table maps packages to talks. There are multiple lines for packages when they map to more than one talk and vice versa. For me, browsing a list like this is akin to perusing a colleague's book shelves; noting old favorites and discovering the occasional new gem.

I am reticent to try any conclusions about the importance of the packages to the talks from the abstracts alone before the talks are given. However, it is interesting to note that all of the packages mentioned more than twice are from the RStudio suite. Moreover, shiny and rmarkdown are apparently still new and shiny. Mentioning them in an abstract still conveys some cachet. I suspect that ggplot2 will figure in more analyses than both of these packages put together, but ggplot2 has become so much a part of that fabric of R that there is no premium in promoting it.

If you are looking forward to useR! 2016, whether it attend in person or to watch the videos afterwards, gaining some familiarity with the R packages highlighted in the contributed talks is sure to enhance your experience.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A Return.Portfolio Wrapper to Automate Harry Long Seeking Alpha Backtests

Thu, 2016-06-16 03:21

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will cover a function to simplify creating Harry Long type rebalancing strategies from SeekingAlpha for interested readers. As Harry Long has stated, most, if not all of his strategies are more for demonstrative purposes rather than actual recommended investments.

So, since Harry Long has been posting some more articles on Seeknig Alpha, I’ve had a reader or two ask me to analyze his strategies (again). Instead of doing that, however, I’ll simply put this tool here, which is a wrapper that automates the acquisition of data and simulates portfolio rebalancing with one line of code.

Here’s the tool.

require(quantmod) require(PerformanceAnalytics) require(downloader) LongSeeker <- function(symbols, weights, rebalance_on = "years", displayStats = TRUE, outputReturns = FALSE) { getSymbols(symbols, src='yahoo', from = '1990-01-01') prices <- list() for(i in 1:length(symbols)) { if(symbols[i] == "ZIV") { download("https://www.dropbox.com/s/jk3ortdyru4sg4n/ZIVlong.TXT", destfile="ziv.txt") ziv <- xts(read.zoo("ziv.txt", header=TRUE, sep=",", format="%Y-%m-%d")) prices[[i]] <- Cl(ziv) } else if (symbols[i] == "VXX") { download("https://dl.dropboxusercontent.com/s/950x55x7jtm9x2q/VXXlong.TXT", destfile="vxx.txt") vxx <- xts(read.zoo("vxx.txt", header=TRUE, sep=",", format="%Y-%m-%d")) prices[[i]] <- Cl(vxx) } else { prices[[i]] <- Ad(get(symbols[i])) } } prices <- do.call(cbind, prices) prices <- na.locf(prices) returns <- na.omit(Return.calculate(prices)) returns$zeroes <- 0 weights <- c(weights, 1-sum(weights)) stratReturns <- Return.portfolio(R = returns, weights = weights, rebalance_on = rebalance_on) if(displayStats) { stats <- rbind(table.AnnualizedReturns(stratReturns), maxDrawdown(stratReturns), CalmarRatio(stratReturns)) rownames(stats)[4] <- "Max Drawdown" print(stats) charts.PerformanceSummary(stratReturns) } if(outputReturns) { return(stratReturns) } }

It fetches the data for you (usually from Yahoo, but a big thank you to Mr. Helumth Vollmeier in the case of ZIV and VXX), and has the option of either simply displaying an equity curve and some statistics (CAGR, annualized standard dev, Sharpe, max Drawdown, Calmar), or giving you the return stream as an output if you wish to do more analysis in R.

Here’s an example of simply getting the statistics, with an 80% XLP/SPLV (they’re more or less interchangeable) and 20% TMF (aka 60% TLT, so an 80/60 portfolio), from one of Harry Long’s articles.

LongSeeker(c("XLP", "TLT"), c(.8, .6))

Statistics:

portfolio.returns Annualized Return 0.1321000 Annualized Std Dev 0.1122000 Annualized Sharpe (Rf=0%) 1.1782000 Max Drawdown 0.2330366 Calmar Ratio 0.5670285

Equity curve:

Nothing out of the ordinary of what we might expect from a balanced equity/bonds portfolio. Generally does well, has its largest drawdown in the financial crisis, and some other bumps in the road, but overall, I’d think a fairly vanilla “set it and forget it” sort of thing.

And here would be the way to get the stream of individual daily returns, assuming you wanted to rebalance these two instruments weekly, instead of yearly (as is the default).

tmp <- LongSeeker(c("XLP", "TLT"), c(.8, .6), rebalance_on="weeks", displayStats = FALSE, outputReturns = TRUE)

And now let’s get some statistics.

table.AnnualizedReturns(tmp) maxDrawdown(tmp) CalmarRatio(tmp)

Which give:

> table.AnnualizedReturns(tmp) portfolio.returns Annualized Return 0.1328 Annualized Std Dev 0.1137 Annualized Sharpe (Rf=0%) 1.1681 > maxDrawdown(tmp) [1] 0.2216417 > CalmarRatio(tmp) portfolio.returns Calmar Ratio 0.5990087

Turns out, moving the rebalancing from annually to weekly didn’t have much of an effect here (besides give a bunch of money to your broker, if you factored in transaction costs, which this doesn’t).

So, that’s how this tool works. The results, of course, begin from the latest instrument’s inception. The trick, in my opinion, is to try and find proxy substitutes with longer histories for newer ETFs that are simply leveraged ETFs, such as using a 60% weight in TLT with an 80% weight in XLP instead of a 20% weight in TMF with 80% allocation in SPLV.

For instance, here are some proxies:

SPXL = XLP
SPXL/UPRO = SPY * 3
TMF = TLT * 3

That said, I’ve worked with Harry Long before, and he develops more sophisticated strategies behind the scenes, so I’d recommend that SeekingAlpha readers take his publicly released strategies as concept demonstrations, as opposed to fully-fledged investment ideas, and contact Mr. Long himself about more customized, private solutions for investment institutions if you are so interested.

Thanks for reading.

NOTE: I am currently in the northeast. While I am currently contracting, I am interested in networking with individuals or firms with regards to potential collaboration opportunities.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Visualizing obesity across United States by using data from Wikipedia

Thu, 2016-06-16 03:11

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

In this post I will show how to collect from a webpage and to analyze or visualize in R. For this task I will use the rvest package and will get the data from Wikipedia. I got the idea to write this post from Fisseha Berhane.

I will gain access to the prevalence of obesity in United States from Wikipedia page, then I will plot it in the map. Lets begin with loading the required packages.

## LOAD THE PACKAGES #### library(rvest) library(ggplot2) library(dplyr) library(scales)

After I loaded the packages in R, will upload the data. As I mention before, I will download the data from Wikipedia.

## LOAD THE DATA #### obesity = read_html("https://en.wikipedia.org/wiki/Obesity_in_the_United_States") obesity = obesity %>% html_nodes("table") %>% .[[1]]%>% html_table(fill=T)

The first line of code is calling the data from Wikipedia and the second line of codes is transforming the table that we are interested into dataframe in R.

Now lets check how our date looks alike.

head(obesity) State and District of Columbia Obese adults Overweight (incl. obese) adults 1 Alabama 30.1% 65.4% 2 Alaska 27.3% 64.5% 3 Arizona 23.3% 59.5% 4 Arkansas 28.1% 64.7% 5 California 23.1% 59.4% 6 Colorado 21.0% 55.0% Obese children and adolescents Obesity rank 1 16.7% 3 2 11.1% 14 3 12.2% 40 4 16.4% 9 5 13.2% 41 6 9.9% 51

The dataframe looks good, now we need to clean it from making ready to plot.

## CLEAN THE DATA #### str(obesity) 'data.frame': 51 obs. of 5 variables: $ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... $ Obese adults : chr "30.1%" "27.3%" "23.3%" "28.1%" ... $ Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ... $ Obese children and adolescents : chr "16.7%" "11.1%" "12.2%" "16.4%" ... $ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ... # remove the % and make the data numeric for(i in 2:4){ obesity[,i] = gsub("%", "", obesity[,i]) obesity[,i] = as.numeric(obesity[,i]) } # check data again str(obesity) 'data.frame': 51 obs. of 5 variables: $ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ... $ Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ... $ Overweight (incl. obese) adults: num 65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ... $ Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ... $ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...

Now we will fix the names of variables by removing the spaces

names(obesity) [1] "State and District of Columbia" "Obese adults" [3] "Overweight (incl. obese) adults" "Obese children and adolescents" [5] "Obesity rank" names(obesity) = make.names(names(obesity)) names(obesity) [1] "State.and.District.of.Columbia" "Obese.adults" [3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents" [5] "Obesity.rank"

Our data looks good. Its time to load the map data

# load the map data states = map_data("state") str(states) 'data.frame': 15537 obs. of 6 variables: $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ... $ lat : num 30.4 30.4 30.4 30.3 30.3 ... $ group : num 1 1 1 1 1 1 1 1 1 1 ... $ order : int 1 2 3 4 5 6 7 8 9 10 ... $ region : chr "alabama" "alabama" "alabama" "alabama" ... $ subregion: chr NA NA NA NA ...

We will merge two datasets (obesity and states) by region, therefore we need first to create new variable (region) in obesity dataset.

# create a new variable name for state obesity$region = tolower(obesity$State.and.District.of.Columbia)

Now we will merge the datasets.

states = merge(states, obesity, by="region", all.x=T) str(states) 'data.frame': 15537 obs. of 11 variables: $ region : chr "alabama" "alabama" "alabama" "alabama" ... $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ... $ lat : num 30.4 30.4 30.4 30.3 30.3 ... $ group : num 1 1 1 1 1 1 1 1 1 1 ... $ order : int 1 2 3 4 5 6 7 8 9 10 ... $ subregion : chr NA NA NA NA ... $ State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ... $ Obese.adults : num 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ... $ Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ... $ Obese.children.and.adolescents : num 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ... $ Obesity.rank : int 3 3 3 3 3 3 3 3 3 3 ... Plot the data

Finally we will plot the prevalence of obesity in adults.

## MAKE THE PLOT #### # adults ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) + labs(title="Prevalence of Obesity in Adults") + coord_map()

Here is the plot in adults:

Similarly, we can plot the prevalence of obesity in children.

# children ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) + labs(title="Prevalence of Obesity in Adults") + coord_map()

Here is the plot in children:

If you like to show the name of State in the map use the code below to create a new dataset.

statenames = states %>% group_by(region) %>% summarise( long = mean(range(long)), lat = mean(range(lat)), group = mean(group), Obese.adults = mean(Obese.adults), Obese.children.and.adolescents = mean(Obese.children.and.adolescents) )

After you add this code to ggplot code above

geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)

That’s all. I hope you learned something useful today.

    Related Post

    1. Plotting App for ggplot2 – Part 2
    2. Mastering R plot – Part 3: Outer margins
    3. Interactive plotting with rbokeh
    4. Mastering R plot – Part 2: Axis
    5. How to create a Twitter Sentiment Analysis using R and Shiny

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Radar charts in R using Plotly

    Thu, 2016-06-16 02:22

    (This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

    This post is inspired by this question on Stack Overflow..

    We’ll show how to create excel style Radar Charts in R using the plotly package.

    library(plotly) library(dplyr) # Read in data df <- read.csv("https://cdn.rawgit.com/plotly/datasets/master/Consumer%20Complaints.csv", stringsAsFactors = F, check.names = F) # Melt df <- reshape2::melt(df, id = c("Company")) colnames(df) <- c("Company", "Complaint", "Percent") getPolarCoord <- function(r, matrix = F, na = F){ # Get starting angle and angle increments theta <- 0 dtheta <- 360 / length(r) dtheta <- (pi / 180) * dtheta # in radians # Get polar coordinates x <- c() y <- c() for(i in 1:length(r)){ x <- c(x, r[i] * cos(theta)) y <- c(y, r[i] * sin(theta)) theta <- theta + dtheta } x[length(x) + 1] <- x[1] y[length(y) + 1] <- y[1] if(na == T){ x[length(x) + 1] <- NA y[length(y) + 1] <- NA } if(matrix == T){ return(cbind(x, y)) }else{ return(list(x = x, y = y)) } } coords <- by(df, df[,"Complaint"], function(r){ x <- getPolarCoord(r[,3]) x <- cbind(x$x, x$y) x <- data.frame(rbind(r, r[1,]), x = x[,1], y = x[,2]) return(x) }) coords <- rbind(coords[[1]], coords[[2]], coords[[3]]) df <- data.frame(coords, txt = paste(coords$Company, "<br>", coords$Complaint, ":", round(coords$Percent*100, 2), "%")) # Plot smooth <- 1 bgcolor <- "white" p <- plot_ly(data = df, x = x, y = y, mode = "lines", group = Complaint, fill = "toself", line = list(smoothing = smooth, shape = "spline"), hoverinfo = "text", text = txt) %>% add_trace(data = df, x = x, y = y, mode = "markers", marker = list(color = "white", size = 10, line = list(width = 2)), hoverinfo = "none", showlegend = F) %>% layout(xaxis = list(title = "", showgrid = F, zeroline = F, showticklabels = F, domain = c(0.02, 0.48)), yaxis = list(title = "", showgrid = F, zeroline = F, showticklabels = F, domain = c(0, 0.92)), font = list(family = "serif", size = 15), legend = list(x = 0.55, y = 0.9, bgcolor = "transparent"), plot_bgcolor = bgcolor, paper_bgcolor = bgcolor) # Add grids grid <- rbind(getPolarCoord(rep(0.05, 50), matrix = T, na = T), getPolarCoord(rep(0.10, 80), matrix = T, na = T), getPolarCoord(rep(0.15, 150), matrix = T, na = T), getPolarCoord(rep(0.20, 170), matrix = T, na = T), getPolarCoord(rep(0.25, 200), matrix = T, na = T)) grid <- as.data.frame(grid) p <- add_trace(p, data = grid, x = x, y = y, mode = "lines", line = list(color = "#57788e", dash = "4px", width = 1), showlegend = F, hoverinfo = "none") inner <- getPolarCoord(rep(0.06, 5)) outer <- getPolarCoord(rep(0.27, 5)) x = t(cbind(inner$x, outer$x)) y = t(cbind(inner$y, outer$y)) x <- as.numeric(apply(x, 2, function(vec){ return(c(vec, NA)) })) y <- as.numeric(apply(y, 2, function(vec){ return(c(vec, NA)) })) linegrid <- data.frame(x = x, y = y) p <- add_trace(p, data = linegrid, x = x, y = y, mode = "lines", line = list(color = "#57788e", dash = "4px", width = 1), showlegend = F, hoverinfo = "none") # Add text banks <- c("Bank of<br>America", "Wells Fargo<br>&Company", "JP Morgan<br>Chase & Co.", "CitiBank", "Capital One") labels <- paste0("<em>", banks, "</em>") p <- add_trace(p, data = getPolarCoord(rep(0.28, 5)), x = x, y = y, mode = "text", text = labels, showlegend = F, hoverinfo = "none", textfont = list(family = "serif", color = "#808080")) # Add a gray circle p <- add_trace(p, data = getPolarCoord(rep(0.24, 200)), x = x, y = y, fill = "toself", fillcolor = "rgba(200, 200, 200, 0.3)", line = list(color = "transparent"), mode = "lines", hoverinfo = "none", showlegend = F) # Add titles, description etc p <- layout(p, annotations = list( list(xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", x = 0.03, y = 1, showarrow = F, text = "<b>Consumer complaints for five large banks in the U.S.</b>", font = list(family = "serif", size = 25, color = "#4080bf")), list(xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", x = 0.03, y = 0.95, showarrow = F, text = '<em>Source: Consumer Financial Protection Bureau</em>', font = list(family = "serif", size = 16, color = "#679bcb")), list(xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", x = 0.60, y = 0.20, showarrow = F, align = "left", text = "Complaints received by the Consumer Financial Protection Bureau<br>regarding financial products and services offered by five large banks in<br>in the United States expressed as a percentage of total nummber<br>of complaints.", font = list(family = "arial", size = 12)), list(xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", x = 0.60, y = 0.05, showarrow = F, align = "left", text = '<a href = "https://catalog.data.gov/dataset/consumer-complaint-database">Click here to go to source</a>', font = list(family = "arial", size = 14)) ), shapes = list( list( xref = "paper", yref = "paper", x0 = 0, x1 = 0.95, y0 = 0, y1 = 1, type = "rect", layer = "above", fillcolor = "rgba(191, 191, 191, 0.1)", line = list(color = "transparent")) )) print(p)

    To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Euro 2016 Squads

    Wed, 2016-06-15 15:43

    (This article was first published on R – Guy Abel, and kindly contributed to R-bloggers)

    This weekend I was having fun in France watching some Euro 2016 matches, visiting friends and avoiding Russian hooligans. Before my flight over I scraped some tables on the tournaments Wikipedia page with my newly acquired rvest skills, with the idea to build up a bilateral database of Euro 2016 squads and their players clubs.

    On the flight I managed to come up with some maps showing these connections. First off I used ggplot2 to plot lines connecting the location of every players club teams to their national squads base in France. The path of the lines were calculated using the gcIntermediate function in the geosphere package. The lines colour is based on the national teams jersey, which I obtained via R using the amazing extract_colours function in the rPlotter package.I was not entirely convinced that this plot is too effective as the base camp of each team in France is not particularly common knowledge. This led me to create a new plot with a link from the players club teams to their nations capital city.

    This shows some clear relationships, such as between the Iceland players and clubs in Scandinavia and clubs in northern England and Scotland with the Irish teams. Finally, I had a go at trying to match my old World Cup circular plot using the chordDiagram function in the circlize package.

    I ordered the countries according to their UEFA coefficient, which is based on the performance of club teams in European competitions. Most players playing abroad are based in teams in the top leagues of England, Germany, Italy or Spain. Players based in the English leagues make up most of the squads for teams from the British Isles. There are sizable numbers of the Austrian and Swiss squads playing for clubs in the German leagues and Croatian players in Italy.

    All the code for the scraping the data and producing the plots are on my Github.

     

    To leave a comment for the author, please follow the link and comment on their blog: R – Guy Abel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Gender ratios of programmers, by language

    Wed, 2016-06-15 11:54

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    While there are many admirable efforts to increase participation by women in STEM fields, in many programming teams men still outnumber women, often by a significant margin. Specifically by how much is a fraught question, and accurate statistics are hard to come by. Another interesting question is whether the gender disparity varies by language, and how to define a "typical programmer" for a given language.

    Jeff Allen from Trestle Tech recently took an interesting approach using R to gather data on gender ratios for programmers: get a list of the top coders for each programming language, and then count the number of men and women in each list. Neither task is trivial. For a list of coders, Jeff scraped GitHub's list of trending repositories over the past month by programming language, and then extracted the avatars for the listed contributors. Then, he used the Microsoft Cognitive Services Face API on the avatar to determine the apparent gender of each contributor, and then tally up the results. You can find the R code he used on GitHub.

    I used Jeff's code to re-run his results based on GitHub's latest monthly rankings. The first thing I needed to do was to request an API Key; a trial key is free with a Microsoft account. (The number of requests per second, but the R code is written to limit the rate of requests accordingly.) I limited my search to the languages C++, C#, Java, Javascript, Python, R and Ruby. The percentage of contributors identified as female, within each language, are shown below:

    According to this analysis, none of the contributors top C++ projects on GitHub are male; by contrast, almost 10% of contributors to R projects are female.

    Now, these data need to be taken with a grain of salt. The main issue is numbers: fewer than 100 programmers per language are identified as "top programmers" via this method, and sometimes significantly fewer (just 45 top C++ contributors were identified). Part of the reason for this is that not all programmers use their face as an avatar; those that used a symbol, logo or cartoon were not counted. Furthermore, it's reasonable to assume that there's a disparity in the rate at which women use their own face as an avatar compared to men, which would add bias to the above results in addition to the variability from the small numbers. Finally, the gender determination is based on an algorithm, and isn't guaranteed to match the gender identity of the programmer (or their avatar).

    Nonetheless, it's an interesting example of using social network data in conjunction with cognitive APIs to conduct demographic studies. You can examples of using other data from the facial analysis, including apparent happiness by language, at the link below.

    (Update June 15: re-ran the analysis and updated the chart above to actually display percentages, not ratios, on the y-axis. The numbers changed slightly as the GitHub data changed. The old chart is here.)

    Trestle Tech: EigenCoder: Programming Stereotypes

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Monthly Regional Tourism Estimates

    Wed, 2016-06-15 08:00

    (This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

    A big 18 month project at work culminated today in the release of new Monthly Regional Tourism Estimates for New Zealand. Great work by the team in an area where we’ve pioneered the way, using administrative data from electronic transactions to supplement traditional sources in producing official statistics.

    Here’s a screen shot from one of the pages letting people play with the data:

    The source data include electronic card spend, survey data, and the National Accounts. All the data production is done in R, with iterative proportional fitting, smoothing, time series forecasts, and yet more iterative proportional fitting before we come up with the best available estimates of how much is spent by tourists by product by origin by time period. This project builds on a range of developments we’ve made since 2012 with previous Regional Tourism Indicators and annual Regional Tourism Estimates. The data will be updated every month from now on.

    A collection of web pages with interactive graphics help users explore the data, all of which can also be downloaded for re-use.

    The interactive graphics are hand coded in JavaScript, with R used for all the pre-processing (eg reshaping, tidying, calculating moving averages, etc).

    Great work, team at work!

    To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Calculate your nutrients with my new package: NutrientData

    Wed, 2016-06-15 03:30

    (This article was first published on Renglish – 56north | Skræddersyet dataanalyse, and kindly contributed to R-bloggers)

    I have created a new package: NutrientData

    This package contains data sets with the composition of Foods: Raw, Processed, Prepared. The source of the data is the USDA National Nutrient Database for Standard Reference, Release 28 (2015), a long with two functions to search and calculate nutrients.

    You download it from github:
    devtools::install_github("56north/NutrientData")

    Lets first have a look at the the top 20 calorie dense foods

    library(NutrientData)
    library(dplyr)

    data("ABBREV") # Load the data

    ABBREV %>% # Select the data
    arrange(-Energ_Kcal) %>% # Sort by calories per 100 g
    select(Food = Shrt_Desc, Calories = Energ_Kcal) %>% # Select relevant columns
    slice(1:20) %>% # Choose the top 20

    If you want to search for a specific ingredient you use the “search_ingredient” function. Lets search for raw onions:

    search_ingredient("onion,raw")

    You can also calculate the nutrient composition of several foods, like a simple yet delicious cabbage salad:

    ingredients <- c("CABBAGE,RAW", "MAYONNAISE,RED FAT,W/ OLIVE OIL", "ONIONS,RAW")
    grams <- c(100, 20, 10)

    calculate_nutrients(ingredients, grams) %>%
    select(Food = 1, Calories = 3, Protein = 4,
    Fat = 5, Carbs = 7) %>% # Select only a few variables for looks and rename

    Dinner is served. I look forward to your feedback! And if anyone is up for it, then this is a package that is just begging for cool visualizations for nutrient composition along with a Shiny overlay.!

    To leave a comment for the author, please follow the link and comment on their blog: Renglish – 56north | Skræddersyet dataanalyse. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs