R bloggers

Syndicate content R-bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 43 min 39 sec ago

Studying CRAN package names

Tue, 2017-05-09 03:00

(This article was first published on R and Finance, and kindly contributed to R-bloggers)

Setting a name for a CRAN package is an intimate process. Out of an
infinite range of possibilities, an idea comes for a package and you
spend at least a couple of days writing up and testing your code before
submitting to CRAN. Once you set the name of the package, you cannot
change it. You choice index your effort and, it shouldn’t be a surprise
that the name of the package can improve its impact.

Looking at package
names
,
one strategy that I commonly observe is to use small words, a verb or
noun, and add the letter R to it. A good example is dplyr. Letter d
stands for dataframe, ply is just a tool, and R is, well, you know. In a
conventional sense, the name of this popular tool is informative and
easy to remember. As always, the extremes are never good. A couple of
bad examples of package naming are A3, AF, BB and so on. Googling
the package name is definitely not helpful. On the other end, package
samplesizelogisticcasecontrol provides a lot of information but it is
plain unattractive!

Another strategy that I also find interesting is developers using names
that, on first sight, are completely unrelated to the purpose of the
package. But, there is a not so obvious link. One example is package
sandwich. At first sight, I challenge anyone to figure out what it
does. This is an econometric package that computes robust standard
errors in a regression model. These robust estimates are also called
sandwich estimators because the formula looks like a
sandwich
.
But, you only know that if you studied a bit of econometric theory. This
strategy works because it is easier to remember things that surprise us.
Another great example is package janitor. I’m sure you already
suspect that it has something do to with data cleaning. And you are
right! The message of the name is effortless and it works! The author
even got the privilege of using letter R in the name.

While I can always hand pick good and bad examples, let’s dig deeper. In
this post, we will study the names of packages available in CRAN by
comparing them to the whole English vocabulary. We are going use the
following datasets:

  • List of CRAN package, available with function
    available.packages().
  • List of English words, available at WordNet
    database
    .
    This is a comprehensive database of English words that I once used
    in a
    paper.
    It contains several tables, including all possible words from the
    English language.

First, let’s have a look at the distribution of size (number of
characters) for all packages available in CRAN:

library(dplyr) library(ggplot2) # get data df.pkgs <- as.data.frame(available.packages(repos = 'https://cloud.r-project.org/')) %>% mutate(Package = as.character(Package), n.char = nchar(Package)) %>% rename(pkg = Package) %>% select(pkg, n.char) # plot it! p <- ggplot(df.pkgs, aes(x=n.char)) + geom_histogram(binwidth = 1) print(p)

As I suspected, the names of CRAN packages are usually small, with an
average of 5-6 characters. We have a couple of packages with more than
25 characters. Let’s see their names:

df.pkgs$pkg[df.pkgs$n.char>25] ## [1] "AnglerCreelSurveySimulation" "FractalParameterEstimation" ## [3] "ig.vancouver.2014.topcolour" "RoughSetKnowledgeReduction" ## [5] "samplesizelogisticcasecontrol"

I am sorry for the authors, but, in my opinion, I’m sure we could find
better names. I am also sorry for those who are using these packages but
do not use the autocomplete
tool

of RStudio and need to type the loooooooooong names.

As for my hypothesis that CRAN package have short names, let’s compare
the distribution of package names against all words in the English
language. For that, let’s load the WordNet database and do some
calculations:

library(RSQLite) library(stringr) # get data conn <- dbConnect(drv = SQLite(), 'WordNet/sqlite-31.db') words <- dbReadTable(conn, 'wordsXsensesXsynsets') %>% select(lemma) # some are duplicate (same word, different types) words <- unique(words) words$nchar <- nchar(words$lemma) # set df to plot df.to.plot <- data.frame(n.char = c(df.pkgs$n.char, words$nchar), source.char = c(rep('CRAN pkgs', nrow(df.pkgs)), rep('English Vocabulary', nrow(words)))) p <- ggplot(df.to.plot, aes(x=n.char, color=source.char )) + geom_density(size=1) + coord_cartesian(xlim=c(0, 40)) print(p)

As I suspected, the distributions are very different. There is no need
to apply a formal test as the visual evidence is clear: CRAN package
have a tendency for shorter names.

Now, let’s look at the distribution of used letters in relative terms:

library(scales) temp <- str_split(str_to_upper(df.pkgs$pkg), '') all.chars <- do.call(what = c,args = temp) char.counts.pkg <- table(all.chars) temp <- str_split(str_to_upper(words$lemma), '') all.chars <- do.call(what = c,args = temp) char.counts.words <- table(all.chars) df.to.plot <- data.frame(perc.count = c(char.counts.pkg/sum(char.counts.pkg), char.counts.words/sum(char.counts.words)), char = c(names(char.counts.pkg), names(char.counts.words)), source.char = c(rep('CRAN pkgs', length(char.counts.pkg)), rep('WordNet', length(char.counts.words)))) # only keep LETTERS idx <- df.to.plot$char %in% LETTERS df.to.plot <- df.to.plot[idx, ] p <- ggplot(df.to.plot, aes(x=char, y = perc.count, color=source.char,width=.5)) + geom_col(position = 'dodge') + scale_y_continuous(labels = percent_format()) print(p)

The result is really interesting! I was expecting far more differences
in the relative use of characters. Not surprisingly, letter R is more
used in package naming than in the English vocabulary. Still, the
difference is not that large. Given that R is the name of the
programming language, I was expecting a much greater proportion of R
characters. My intuition was clearly wrong! In comparison, letters P and
M have more difference in relative terms than letter R. I’m really not
sure why that is. Overall, it is pretty clear the use of characters in
the names of packages follow the distribution of words in the English
language.

While the distribution of letter is similar, we find just a few package
with names exactly as in the English language. For all 10524 packages
found in CRAN, only 698 are an exact match of all 147478 unique words in
the English vocabulary. If we can’t match them all, let’s see how far
they are from the English dictionary. For that, we use package
stringdist to compute the minimum editing distance that we can find
for all package names with respect to the English vocabulary. In a
nutshell, the editing distance measures how many string modifications we
need in order for two strings to match each other. By computing the
minimum editing distance of package’s names against the English
vocabulary, we have a measure of equality. Here I’m using method='lv',
which seems to be the most appropriate in this study.

my.fct <- function(str.in,possible.names ){ require(stringdist) #my.dist<- possible.names[which.min(stringdist(str.in, possible.names ))] my.dist<- min(stringdist(str.in, possible.names, method='lv')) #my.dist<- min(adist(str.in, possible.names )) return(my.dist) } char.distances <- pbapply::pbsapply(df.pkgs$pkg, FUN = my.fct, possible.names=words$lemma) ## Loading required package: stringdist

Let’s look at the results:

p <- ggplot(data.frame(char.distances), aes(x=char.distances))+ geom_histogram(binwidth = 1) print(p)

As we can see, most of packages names are just three or four edits away.
This shows how similar the choice of packages is to the English
vocabulary.

Summing up, our data analysis shows that the names of packages are
usually shorter than the words in the English language. However, when
looking at distribution of used characters and editing distances, it is
pretty clear that the names are based on the English language, usually
with a few modifications of a base word.

I hope you enjoyed this post. In the next one I will explore the
package’s authors and the use of comments in R code.

To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

setTimeout("location.href = \'https://www.r-bloggers.com/studying-cran-package-names/\';",30500);
Categories: Methodology Blogs

Travis-CI Flaw Exposed Some ‘Secure’ Environment Variable Contents

Mon, 2017-05-08 23:04

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Tagging this as #rstats-related since many R coders use Travis-CI to automate package builds (and other things). Security researcher Ivan Vyshnevskyi did some ++gd responsible disclosure to the Travis-CI folks letting them know they were leaking the contents of “secure” environment variables in the build logs.

The TL;DR on “secure” environment variables is that they let you store secrets — such as OAuth keys or API tokens — ostensibly “securely” (they have to be decrypted to be used so someone/something has they keys to do that so it’s not really “secure”). That is, they should not leak them in build logs. Except that they did…for a bit.

As mentioned, this flaw was reported and is now fixed. Regen your “secrets” and keep an eye on Travis security announcements moving forward.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

setTimeout("location.href = \'https://www.r-bloggers.com/travis-ci-flaw-exposed-some-secure-environment-variable-contents/\';",30500);
Categories: Methodology Blogs

Video Introduction to Bayesian Data Analysis, Part 3: How to do Bayes?

Mon, 2017-05-08 17:50

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)


This is the last video of a three part introduction to Bayesian data analysis aimed at you who isn’t necessarily that well-versed in probability theory but that do know a little bit of programming. If you haven’t watched the other parts yet, I really recommend you do that first: Part 1 & Part 2.

This third video covers the how? of Bayesian data analysis: How to do it efficiently and how to do it in practice. But covers is really a big word, briefly introduces is really more appropriate. Along the way I will then briefly introduce Markov chain Monte Carlo, parameter spaces and the computational framework Stan:



After having watched this video tutorial (and having done the exercises) you won’t be an expert in any way, like the expert fisherman depicted below by the master Hokusai. But, at least, you will have caught your own first fish!

To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

setTimeout("location.href = \'https://www.r-bloggers.com/video-introduction-to-bayesian-data-analysis-part-3-how-to-do-bayes/\';",30500);
Categories: Methodology Blogs

Jobs for R users – from all over the world (2017-05-08)

Mon, 2017-05-08 16:47
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New Jobs
  1. Full-Time
    Data Analyst, Chicago Violence Reduction Strategy National Network for Safe Communities (NNSC) – Posted by nnsc
    Chicago Illinois, United States
    5 May2017
  2. Full-Time
    Transportation Market Research Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com
    Arlington Virginia, United States
    4 May2017
  3. Full-Time
    Data Manager @ Los Angeles, California, U.S. Virtual Pediatric Systems, LLC – Posted by gsotocampos
    Los Angeles California, United States
    3 May2017
  4. Full-Time
    Sr. Manager, Analytics @ Minneapolis, Minnesota, U.S. Korn Ferry – Posted by jone1087
    Minneapolis Minnesota, United States
    3 May2017
  5. Full-Time
    Health Services Data Science Associate (ZIM1701.17) @ Palo Alto, California Palo Alto Veterans Institute for Research – Posted by PAVIR
    Palo Alto California, United States
    26 Apr2017
  6. Full-Time
    R Data Visualization Intern (paid) @ Washington, United States Federal Reserve Board – Posted by brittneymolloy
    Washington District of Columbia, United States
    20 Apr2017
  7. Full-Time
    Data Scientist for ADM @ Reno, Nevada, United States ADM Associates, Inc. – Posted by Elaine
    Reno Nevada, United States
    20 Apr2017
  8. Full-Time
    Data Scientist for Parkbob @ Wien, Austria Parkbob – Posted by ivan.kasanicky
    Wien Wien, Austria
    20 Apr2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

setTimeout("location.href = \'https://www.r-bloggers.com/jobs-for-r-users-from-all-over-the-world-2017-05-08/\';",30500);
Categories: Methodology Blogs

Machine Learning Pipelines for R

Mon, 2017-05-08 13:00

(This article was first published on R – When Localhost Isn't Enough, and kindly contributed to R-bloggers)

Building machine learning and statistical models often requires pre- and post-transformation of the input and/or response variables, prior to training (or fitting) the models. For example, a model may require training on the logarithm of the response and input variables. As a consequence, fitting and then generating predictions from these models requires repeated application of transformation and inverse-transformation functions – to go from the domain of the original input variables to the domain of the original output variables (via the model). This is usually quite a laborious and repetitive process that leads to messy code and notebooks.

The pipeliner package aims to provide an elegant solution to these issues by implementing a common interface and workflow with which it is possible to:

  • define transformation and inverse-transformation functions;
  • fit a model on training data; and then,
  • generate a prediction (or model-scoring) function that automatically applies the entire pipeline of transformations and inverse-transformations to the inputs and outputs of the inner-model and its predicted values (or scores).

The idea of pipelines is inspired by the machine learning pipelines implemented in Apache Spark’s MLib library (which are in-turn inspired by Python’s scikit-Learn package). This package is still in its infancy and the latest development version can be downloaded from this GitHub repository using the devtools package (bundled with RStudio),

devtools::install_github("alexioannides/pipeliner") Pipes in the Pipleline

There are currently four types of pipeline section – a section being a function that wraps a user-defined function – that can be assembled into a pipeline:

  • transform_features: wraps a function that maps input variables (or features) to another space – e.g.,
transform_features(function(df) { data.frame(x1 = log(df$var1)) })
  • transform_response: wraps a function that maps the response variable to another space – e.g.,
transform_response(function(df) { data.frame(y = log(df$response)) })
  • estimate_model: wraps a function that defines how to estimate a model from training data in a data.frame – e.g.,
estimate_model(function(df) { lm(y ~ 1 + x1, df) })
  • inv_transform_features(f): wraps a function that is the inverse to transform_response, such that we can map from the space of inner-model predictions to the one of output domain predictions – e.g.,
inv_transform_response(function(df) { data.frame(pred_response = exp(df$pred_y)) })

As demonstrated above, each one of these functions expects as its argument another unary function of a data.frame (i.e. it has to be a function of a single data.frame). With the exception of estimate_model, which expects the input function to return an object that has a predict.object-class-name method existing in the current environment (e.g. predict.lm for linear models built using lm()), all the other transform functions also expect their input functions to return data.frames (consisting entirely of columns not present in the input data.frame). If any of these rules are violated then appropriately named errors will be thrown to help you locate the issue.

If this sounds complex and convoluted then I encourage you to to skip to the examples below – this framework is very simple to use in practice. Simplicity is the key aim here.

Two Interfaces to Rule Them All

I am a great believer and protagonist for functional programming – especially for data-related tasks like building machine learning models. At the same time the notion of a ‘machine learning pipeline’ is well represented with a simple object-oriented class hierarchy (which is how it is implemented in Apache Spark’s). I couldn’t decide which style of interface was best, so I implemented both within pipeliner (using the same underlying code) and ensured their output can be used interchangeably. To keep this introduction simple, however, I’m only going to talk about the functional interface – those interested in the (more) object-oriented approach are encouraged to read the manual pages for the ml_pipeline_builder ‘class’.

Example Usage with a Functional Flavor

We use the faithful dataset shipped with R, together with the pipeliner package to estimate a linear regression model for the eruption duration of ‘Old Faithful’ as a function of the inter-eruption waiting time. The transformations we apply to the input and response variables – before we estimate the model – are simple scaling by the mean and standard deviation (i.e. mapping the variables to z-scores).

The end-to-end process for building the pipeline, estimating the model and generating in-sample predictions (that include all interim variable transformations), is as follows,

library(pipeliner) data <- faithful lm_pipeline <- pipeline( data, transform_features(function(df) { data.frame(x1 = (df$waiting - mean(df$waiting)) / sd(df$waiting)) }), transform_response(function(df) { data.frame(y = (df$eruptions - mean(df$eruptions)) / sd(df$eruptions)) }), estimate_model(function(df) { lm(y ~ 1 + x1, df) }), inv_transform_response(function(df) { data.frame(pred_eruptions = df$pred_model * sd(df$eruptions) + mean(df$eruptions)) }) ) in_sample_predictions <- predict(lm_pipeline, data, verbose = TRUE) head(in_sample_predictions) ## eruptions waiting x1 pred_model pred_eruptions ## 1 3.600 79 0.5960248 0.5369058 4.100592 ## 2 1.800 54 -1.2428901 -1.1196093 2.209893 ## 3 3.333 74 0.2282418 0.2056028 3.722452 ## 4 2.283 62 -0.6544374 -0.5895245 2.814917 ## 5 4.533 85 1.0373644 0.9344694 4.554360 ## 6 2.883 55 -1.1693335 -1.0533487 2.285521 Accessing Inner Models and Prediction Functions

We can access the estimated inner models directly and compute summaries, etc – for example,

summary(lm_pipeline$inner_model) ## ## Call: ## lm(formula = y ~ 1 + x1, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.13826 -0.33021 0.03074 0.30586 1.04549 ## ## Residual standard error: 0.435 on 270 degrees of freedom ## Multiple R-squared: 0.8115, Adjusted R-squared: 0.8108 ## F-statistic: 1162 on 1 and 270 DF, p-value: < 2.2e-16

Pipeline prediction functions can also be accessed directly in a similar way – for example,

pred_function <- lm_pipeline$predict predictions <- pred_function(data, verbose = FALSE) head(predictions) ## pred_eruptions ## 1 4.100592 ## 2 2.209893 ## 3 3.722452 ## 4 2.814917 ## 5 4.554360 ## 6 2.285521 Turbo-Charged Pipelines in the Tidyverse

The pipeliner approach to building models becomes even more concise when combined with the set of packages in the [tidyverse](http://tidyverse.org "Welcome to The Tidyverse!"). For example, the 'Old Faithful' pipeline could be rewritten as,

library(tidyverse) lm_pipeline % pipeline( transform_features(function(df) { transmute(df, x1 = (waiting - mean(waiting)) / sd(waiting)) }), transform_response(function(df) { transmute(df, y = (eruptions - mean(eruptions)) / sd(eruptions)) }), estimate_model(function(df) { lm(y ~ 1 + x1, df) }), inv_transform_response(function(df) { transmute(df, pred_eruptions = pred_model * sd(eruptions) + mean(eruptions)) }) ) head(predict(lm_pipeline, data)) ## [1] 4.100592 2.209893 3.722452 2.814917 4.554360 2.285521

Nice, compact and expressive (if I don’t say so myself)!

Compact Cross-validation

If we now introduce the modelr package into this workflow and adopt the the list-columns pattern described in Hadley Wickham’s R for Data Science, we can also achieve wonderfully compact end-to-end model estimation and cross-validation,

library(modelr) # define a function that estimates a machine learning pipeline on a single fold of the data pipeline_func <- function(df) { pipeline( df, transform_features(function(df) { transmute(df, x1 = (waiting - mean(waiting)) / sd(waiting)) }), transform_response(function(df) { transmute(df, y = (eruptions - mean(eruptions)) / sd(eruptions)) }), estimate_model(function(df) { lm(y ~ 1 + x1, df) }), inv_transform_response(function(df) { transmute(df, pred_eruptions = pred_model * sd(eruptions) + mean(eruptions)) }) ) } # 5-fold cross-validation using machine learning pipelines cv_rmse % mutate(model = map(train, ~ pipeline_func(as.data.frame(.x))), predictions = map2(model, test, ~ predict(.x, as.data.frame(.y))), residuals = map2(predictions, test, ~ .x - as.data.frame(.y)$eruptions), rmse = map_dbl(residuals, ~ sqrt(mean(.x ^ 2)))) %>% summarise(mean_rmse = mean(rmse), sd_rmse = sd(rmse)) cv_rmse ## # A tibble: 1 × 2 ## mean_rmse sd_rmse ## ## 1 0.4877222 0.05314748 Forthcoming Attractions

I built pipeliner largely to fill a hole in my own workflows. Up until now I’ve used Max Kuhn’s excellent caret package quite a bit, but for in-the-moment model building (e.g. within a R Notebook) it wasn’t simplifying the code that much, and the style doesn’t quite fit with the tidy and functional world that I now inhabit most of the time. So, I plugged the hole by myself. I intend to live with pipeliner for a while to get an idea of where it might go next, but I am always open to suggestions (and bug notifications) – please leave any ideas here.

To leave a comment for the author, please follow the link and comment on their blog: R – When Localhost Isn't Enough. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

setTimeout("location.href = \'https://www.r-bloggers.com/machine-learning-pipelines-for-r/\';",30500);
Categories: Methodology Blogs

From Points to (Messy) Lines

Mon, 2017-05-08 12:05

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

A week or so ago, I came up with a new chart type – race concordance charts – for looking at a motor circuit race from the on-track perspective of a particular driver. Here are a couple of examples from the 2017 F1 Grand Prix:

The gap is the time to the car on track ahead (negative gap, to the left) or behind (to the right). The colour indicates whether the car is on the same lap (light blue),  on the lap behind (orange to red), or a lap ahead (dark blue).

In the dots, we can “see” lines relating to the relative progress of particular cars. But what if we actually plot the progress of each of those other cars as a line? The colours represent different cars.

 

Here’s another view of the track from Hulkenberg’s perspective with a wider window, whoch by comparison with the previous chart suggests I need to handle better cars that do not drop off the track but do fall out of the display window… (At the moment, I only grab data for cars in the specified concordance window):

Note that we need to do a little bit of tidying up of the data so that we don’t connect lines for cars that flow off the left hand edge, for example, and then return several laps later from the right hand edge:

#Get the data for the cars, as before inscope=sqldf(paste0('SELECT l1.code as code,l1.acctime-l2.acctime as acctimedelta, l2.lap-l1.lap as lapdelta, l2.lap as focuslap FROM lapTimes as l1 join lapTimes as l2 WHERE l1.acctime < (l2.acctime + ', abs(limits[2]), ') AND l1.acctime > (l2.acctime - ', abs(limits[1]),') AND l2.code="',code,'";')) #If consecutive rows for same driver are on more than one focuslap apart, break the line inscope=ddply(inscope,.(code),transform,g=cumsum(c(0,diff(focuslap)>1))) #Continuous line segments have the same driver code and "group" number g = ggplot(inscope) #The interaction splits up the groups based on code and the contiguous focuslap group number #We also need to ensure we plot acctimedelta relative to increasing focuslap g=g+geom_line(aes(x=focuslap, y=acctimedelta, col=code,group=interaction(code, g))) #...which means we then need to flip the axes g=g+coord_flip()

There may still be some artefacts in the line plotting based on lapping… I can’t quite think this through at the moment:-(

So here’s my reading:

  • near horizontal lines that go slightly up and to the right, and where a lot of places in the window are lost in a single lap are a result of pit stop by the car that lost the places; if we have access to pit information, we could perhaps dot these lines?
  • the “waist” in the chart for HUL shows cars coming together for a safety car, and then HUL losing pace to some cars whilst making advances on others;
  • lines with a constant gradient show a  consistent gain or loss of time, per lap, over several laps;
  • a near vertical line shows a car keeping pace, and neither making nor losing time compared to the focus car.

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

setTimeout("location.href = \'https://www.r-bloggers.com/from-points-to-messy-lines/\';",30500);
Categories: Methodology Blogs

Machine Learning. Regression Trees and Model Trees (Predicting Wine Quality)

Mon, 2017-05-08 11:00

(This article was first published on Data R Value, and kindly contributed to R-bloggers)

We will develop a forecasting example using model trees and regression trees algorithms. The exercise was originally published in “Machine Learning in R” by Brett Lantz, PACKT publishing 2015 (open source community experience destilled).

The example we will develop is about predicting wine quality.

Image from Wine Folly

We will carry out the exercise verbatim as published in the aforementioned reference.

For more details on the model trees and regression trees algorithms it is recommended to check the aforementioned reference or any other bibliography of your choice.


### install an load required packages
#install.packages(“rpart”)
#install.packages(“rpart.plot”)

library(rpart)
library(rpart.plot)
library(RWeka)

### read and explore the data
wine <- read.csv(“whitewines.csv”)
str(wine)

### examine the distribution of the outcome variable
hist(wine$quality, main= “Wine Quality”, col= “red”)

### examine output for outliers or other potential data problems
summary(wine)

### split the data into training and testing datasets
### 75% – 25%

wine_train <- wine[1:3750, ]
wine_test <- wine[3751:4898, ]

### training a model on the data

### training a regression tree model
m.rpart <- rpart(quality ~ ., data = wine_train)

### explore basic information about the tree
### for each node in the tree, the number of examples reaching
### the decision point is listed

m.rpart

### more detailed summary of tree’s fit

summary(m.rpart)

### visualazing the decision tree
rpart.plot(m.rpart, digits = 3)


rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE,
           type = 3, extra = 101)
 

### evaluating model performance
p.rpart <- predict(m.rpart, wine_test)

### model is not correctly identifying the extreme cases
summary(p.rpart)
summary(wine_test$quality)

### correlation between actual and predicted quality
cor(p.rpart, wine_test$quality)

### Measuring performance with the mean absolute error

MAE <- function(actual, predicted) {
  mean(abs(actual – predicted))
}

### the MAE for our predictions

MAE(p.rpart, wine_test$quality)

### the mean quality rating in the training data
mean(wine_train$quality)

### If we predicted the value 5.87 for every wine sample,
### we would have a mean absolute error of only about 0.67:

MAE(5.87, wine_test$quality)

### improving model performance?

### model tree
### model tree improves on regression trees by replacing the
### leaf nodes with regression models
### M5 algorithm (M5-prime)

m.m5p <- M5P(quality ~ ., data = wine_train)

### examine the tree and linear models

m.m5p
summary(m.m5p)

### predicted values and performance
p.m5p <- predict(m.m5p, wine_test)

summary(p.m5p)

cor(p.m5p, wine_test$quality)

MAE(wine_test$quality, p.m5p)


###
You can get the example and the dataset in:
https://github.com/pakinja/Data-R-Value 

To leave a comment for the author, please follow the link and comment on their blog: Data R Value. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

setTimeout("location.href = \'https://www.r-bloggers.com/machine-learning-regression-trees-and-model-trees-predicting-wine-quality/\';",30500);
Categories: Methodology Blogs

Installing Packages without Internet

Mon, 2017-05-08 09:19

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Graham Parsons

At Mango we’re often giving R training in locations where a reliable WiFi connection is not always guaranteed, so if we need trainees to download packages from CRAN it can be a show-stopper.

Here are a couple of code snippets that are useful to download packages from CRAN onto a USB stick when you have a good connection and then install them on site from the USB should you need to.

In the Office: Download the Dependencies

Knowing the packages we need is one thing, but knowing which packages they depend on is another, and knowing which packages those dependencies depend on is… well, not worth thinking about – there’s a function that comes with R to do it for us called package_dependencies().

Here’s a short example script that uses package_dependencies() to figure out the dependencies from the packages we want to use.

#' Get package dependencies #' #' @param packs A string vector of package names #' #' @return A string vector with packs plus the names of any dependencies getDependencies <- function(packs){ dependencyNames <- unlist( tools::package_dependencies(packages = packs, db = available.packages(), which = c("Depends", "Imports"), recursive = TRUE)) packageNames <- union(packs, dependencyNames) packageNames } # Calculate dependencies packages <- getDependencies(c("tidyverse", "mangoTraining"))

We can then download the right package type for the environment we’re going to be training. Often our customers are on Windows so we would download the “win.binary” type. We’re also going to save the package file names too so that we can install them by filename later.

# Download the packages to the working directory. # Package names and filenames are returned in a matrix. setwd("D:/my_usb/packages/") pkgInfo <- download.packages(pkgs = packages, destdir = getwd(), type = "win.binary") # Save just the package file names (basename() strips off the full paths leaving just the filename) write.csv(file = "pkgFilenames.csv", basename(pkgInfo[, 2]), row.names = FALSE) On Site: Install the Packages

Assuming we’ve downloaded our packages to a USB stick or similar, on site and without an internet connection we can now install the packages from disk.

# Set working directory to the location of the package files setwd("D:/my_usb/packages/") # Read the package filenames and install pkgFilenames <- read.csv("pkgFilenames.csv", stringsAsFactors = FALSE)[, 1] install.packages(pkgFilenames, repos = NULL, type = "win.binary")

That’s it! If you want to know more, the code for this post can be found on GitHub.

div.pgLeft, ul.xoxo { display: none } div.case_studies { width: 100%; margin-left: 0px }

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

setTimeout("location.href = \'https://www.r-bloggers.com/installing-packages-without-internet/\';",30500);
Categories: Methodology Blogs

Graphical Presentation of Missing Data; VIM Package

Mon, 2017-05-08 08:16

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Missing data is a problem that challenge data analysis methodologically and computationally in medical research. Patients of the clinical trials and cohort studies may drop out of the study, and therefore, generate missing data. The missing data could be at random when participants who drop out of study are not different from those who remained in study. For example, in the study of body mass index and cholesterol levels, participants who don’t measure their blood cholesterol have a comparable body mass index with participants who measure their blood cholesterol.

To handle missing data researchers often choose to conduct analysis among participants without missing data (i.e., complete case analysis), but sometimes they prefer to impute the data. In the previous tutorials (Tutorial 1, Tutorial 2) published at DataScience+ we have shown how to impute the missing data by using MICE package. In this tutorial I will show how to graphically present the missing data, with only one purpose, to find whether missing is at random. Therefore, we will build plots by using the function marginplot from VIM package. This would be a short “how to” tutorial and have no intention to explain types of the missing data. You can learn more about types of missing data such as missing completely at random, missing at random, and not missing at random from the book Statistical Analysis with Missing Data.

Data Preparation

I simulated a database with 250 observations for illustration purpose and there is no clinical relevance. There are five variables Age, Gender, Cholesterol, SystolicBP, BMI, Smoking, and Education.

Load the libraries and get the data by running the scrip below:

library(VIM) library(mice) library(dplyr) library(tibble) dat <- read.csv(url("https://goo.gl/4DYzru"), header=TRUE, sep=",") head(dat) ## Age Gender Cholesterol SystolicBP BMI Smoking Education ## 1 67.9 Female 236.4 129.8 26.4 Yes High ## 2 54.8 Female 256.3 133.4 28.4 No Medium ## 3 68.4 Male 198.7 158.5 24.1 Yes High ## 4 67.9 Male 205.0 136.0 19.9 No Low ## 5 60.9 Male 207.7 145.4 26.7 No Medium ## 6 44.9 Female 222.5 130.6 30.6 No Low

In this database there are no missing. I will introduce missing not at random for the cholesterol variable. As it shown in the code below, missing values for cholesterol will include only participants with body mass index levels greater than 30 (i.e., participants with obesity).

set.seed(10) missing = rbinom(250, 1, 0.3) dat$Cholesterol = with(dat, ifelse(BMI>=30&missing==1, NA, Cholesterol)) sum(is.na(dat$Cholesterol)) [1] 16

We have 16 participants with missing values in cholesterol variable. I am going to impute the missing values by using MICE package and PMM (predictive mean matching) method.

init = mice(dat, maxit=0) meth = init$method predM = init$predictorMatrix meth[c("Cholesterol")]="pmm" set.seed(101) imputed = mice(dat, method=meth, predictorMatrix=predM, m=1) imp = complete(imputed)

Next, I will create a database with imputed data and an indicator variable which shows which observations are imputed. This is necessary for plotting by using marginplot function.

dt1 = dat %>% select(Cholesterol, BMI) %>% rename(Cholesterol_imp = Cholesterol) %>% mutate( Cholesterol_imp = as.logical(ifelse(is.na(Cholesterol_imp), "TRUE", "FALSE")) ) %>% rownames_to_column() dt2 = imp %>% select(Cholesterol, BMI) %>% rownames_to_column() dt = left_join(dt1, dt2) head(dt) rowname Cholesterol_imp BMI Cholesterol 1 1 FALSE 26.4 236.4 2 2 FALSE 28.4 256.3 3 3 FALSE 24.1 198.7 4 4 FALSE 19.9 205.0 5 5 FALSE 26.7 207.7 6 6 FALSE 30.6 222.5 Graphical presentation of missing data

Now that we have a database with the imputed variables and the corresponding indicator, whether the observation is imputed or not, we will plot it by using the function marginplot from VIM package.

vars <- c("BMI","Cholesterol","Cholesterol_imp") marginplot(dt[,vars], delimiter="_imp", alpha=0.6, pch=c(19))

This is the output of code above:

The blue color in the scatterplot above indicate the non-missing values for the cholesterol, and the orange shows the missing data which is imputed. As we can see participants with missing data in cholesterol have a higher body mass index that participants without missing data which indicate that missing is not at random. This scatterplot also shows the boxplot which ilustrate the distribution of data. For example the median of body mass index for participants with missing data is round 33, whereas for those without missing is round 26.

A missing at random will look like in the plot below

Conclusion

In this post, we showed how to use marginplot of VIM package to identify whether your missing data is at random. To learn more about VIM package I suggest to read the paper by Bernd Prantner.

I hope you find this post useful. Leave comments below if you have questions.

    Related Post

    1. Creating Graphs with Python and GooPyCharts
    2. Visualizing Tennis Grand Slam Winners Performances
    3. Plotting Data Online via Plotly and Python
    4. Using PostgreSQL and shiny with a dynamic leaflet map: monitoring trash cans
    5. Visualizing Streaming Data And Alert Notification with Shiny

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/graphical-presentation-of-missing-data-vim-package/\';",30500);
    Categories: Methodology Blogs

    Trading Strategy: 52-Weeks High Effect in Stocks

    Mon, 2017-05-08 07:12

    By Milind Paradkar

    In today’s algorithmic trading having a trading edge is one of the most critical elements. It’s plain simple. If you don’t have an edge, don’t trade! Hence, as a quant, one is always on a look out for good trading ideas. One of the good resources for trading strategies that have been gaining wide popularity is the Quantpedia site. Quantpedia has thousands of financial research papers that can be utilized to create profitable trading strategies.

     

    The “Screener” page on Quantpedia categorizes hundreds of trading strategies based on different parameters like Period, Instruments, Markets, Complexity, Performance, Drawdown, Volatility, Sharpe etc.

     

    Quantpedia has made some of these trading strategies available for free to their users. In this article, we will explore one such trading strategy listed on their site called the “52-Weeks High Effect in Stocks”.

    52-Weeks High Effect in Stocks

    (http://quantpedia.com/Screener/Details/18)

    The Quantpedia page for this trading strategy provides a detailed description which includes the 52-weeks high effect explanation, source research paper, other related papers, a visualization of the strategy performance and also other related trading strategies.

    What is 52-Weeks High Effect? 

    Let us put down the lucid explanation provided on Quantpedia here –

    The “52-week high effect” states that stocks with prices close to the 52-week highs have better subsequent returns than stocks with prices far from the 52week highs. Investors use the 52-week high as an “anchor” which they value stocks against. When stock prices are near the 52-week high, investors are unwilling to bid the price all the way to the fundamental value. As a result, investors’ under-react when stock prices approach the 52-week high, and this creates the 52-week high effect.

    Source Paper

    (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1787378)

    The Source paper, “Industry Information and the 52-Week High Effect” has been authored by Xin Hong, Bradford D. Jordan, and Mark H. Liu.

    The financial paper says that traders use the 52-week high as a reference point which they evaluate the potential impact of news against. When good news has pushed a stock’s price near or to a new 52-week high, traders are reluctant to bid the price of the stock higher even if the information warrants it. The information eventually prevails and the price moves up, resulting in a continuation. It works similarly for 52-week lows.

    The trading strategy developed by the authors buys stocks in industries in which stock prices are close to 52-week highs and shorts stocks in industries in which stock prices are far from 52-week highs. They found that the industry 52-week high trading strategy is more profitable than the individual 52-week high trading strategy proposed by George and Hwang (2004).

    Framing our 52-Weeks High Effect Strategy using R programming

    Having understood the 52-weeks High Effect, we will try to backtest a simple trading strategy using R programming. Please note that we are not trying to replicate the exact trading strategy developed by the authors in their research paper.

    We test our trading strategy for a 3-year backtest period using daily data on around 140 stocks listed on the National Stock Exchange of India Ltd. (NSE).

    Brief about the strategy – The trading strategy reads the daily historical data for each stock in the list and checks if the price of the stock is near its 52-week high at the start of each month. We have shown how to check for this condition in step 4 of the trading strategy formulation process illustrated below. For all the stocks that pass this condition, we form an equal weighted portfolio for that month. We take a long position in these stocks at the start of the month and square off our position at the start of the next month. We follow this process for every month of our backtest period. Finally, we compute and chart the performance metrics of our trading strategy.

    Now, let us understand the process of trading strategy formulation in a step-by-step manner. For reference, we have posted the R code snippets of relevant sections of the trading strategy under its respective step.

    Step 1: First, we set the backtest period, and the upper and lower thresholds values for determining whether a stock is near its 52-week high.

    # Setting the lower and upper threshold limits lower_threshold_limit = 0.90 # (eg.0.90 = 90%) upper_threshold_limit = 0.95 # (eg.0.95 = 95%) # Backtesting period (Eg. 1 = 1 year) minimum period selected should be 2 years. noDays = 4

    Step 2: In this step, we read the historical stock data using the read.csv function from R. We are using the data from Google finance and it consists of the Open/High/Low/Close (OHLC) & Volume values.

    # Run the program for each stock in the list for(s in 1:length(symbol)){ print(s) dirPath = paste(getwd(),"/4 Year Historical Data/",sep="") fileName = paste(dirPath,symbol[s],".csv",sep="") data = as.data.frame(read.csv(fileName)) data$TICKER = symbol[s] # Merge NIFTY prices with Stock data and select the Close Price data = merge(data,data_nifty, by = "DATE") data = data[, c("DATE", "TICKER","CLOSE.x","CLOSE.y")] colnames(data) = c("DATE","TICKER","CLOSE","NIFTY") N = nrow(data)

    Step 3: Since we are using the daily data we need to determine the start date of each month. The start date need not necessarily be the 1st of every month because the 1st can be a weekend or a holiday for the stock exchange. Hence, we write an R code which will determine the first date of each month.

    # Determine the date on which each month of the backtest period starts data$First_Day = "" day = format(ymd(data$DATE),format="%d") monthYr = format(ymd(data$DATE),format="%Y-%m") yt = tapply(day,monthYr, min) first_day = as.Date(paste(row.names(yt),yt,sep="-")) frows = match(first_day, ymd(data$DATE)) for (f in frows) {data$First_Day[f] = "First Day of the Month"} data = data[, c("TICKER","DATE", "CLOSE","NIFTY","First_Day")]

    Step 4: Check if the stock is near the 52-week high mark. In this part, we first compute the 52-week high price for each stock. We then compute the upper and the lower thresholds using the 52-week high price.

    Example:

    If the lower threshold = 0.90, upper threshold = 0.95 and the 52-week high = 1200, then the threshold range is given by:

    Threshold range = (0.90 * 1200) – (0.95 * 1200)

    Threshold range = 1080 to 1140

    If the stock price at the start of the month falls in this range, we then consider the stock to be near its 52-week high mark. We have also included one additional condition in the step. This condition checks whether the stock price in the past 30 days had reached the current 52-week high price and whether it is within the threshold range now. Such a stock will not be included in our portfolio as we assume that the stock price is in decline after reaching today’s 52-week high price.

    # Check if the stock is near its 52-week high at the start of the each month data$Near_52_Week_High = "" ; data$Max_52 = numeric(nrow(data)); data$Max_Not = numeric(nrow(data)); frows_tp = frows[frows >= 260] for (fr in frows_tp){ # This will determine the max price in the last 1 year (252 trading days) data$Max_52[fr] = max(data$CLOSE[(fr-252):(fr-1)]) # This will check whether the max price has occurred in the last "x" days. data$Max_Not[fr] = max(data$CLOSE[(fr-no_max):(fr-1)]) if ((data$CLOSE[fr] >= lower_threshold_limit * data$Max_52[fr]) & (data$CLOSE[fr] <= upper_threshold_limit * data$Max_52[fr]) & (data$Max_Not[fr] != data$Max_52[fr]) == TRUE ){ data$Near_52_Week_High[fr] = "Near 52-Week High" } else { data$Near_52_Week_High[fr] = "Not Near 52-Week High" } }

    Step 5: For all the stocks that fulfill the criteria mentioned in the step above, we create a long-only portfolio. The entry price equals the price at the start of the month. We square off our long position at the start of the next month. We consider the close price of the stock for our entry and exit trades.

    # Enter into a long position for stocks at each start of month data = subset(data,select=c(TICKER,DATE,CLOSE,NIFTY,First_Day,Max_52,Near_52_Week_High) ,subset=(First_Day=="First Day of the Month")) data$NEXT_CLOSE = lagpad(data$CLOSE, 1) colnames(data) = c("TICKER","DATE","CLOSE","NIFTY","First_Day","Max_52","Near_52_Week_High", "NEXT_CLOSE") data$Profit_Loss = numeric(nrow(data)); data$Nifty_change = numeric(nrow(data)); for (i in 1:length(data$CLOSE)) { if ((data$Near_52_Week_High[i] == "Near 52-Week High") == TRUE){ data$Profit_Loss[i] = round(data$CLOSE[i+1] - data$CLOSE[i],2) data$Nifty_change[i] = round(Delt(data$NIFTY[i],data$NIFTY[i+1])*100,2) } } for (i in 1:length(data$CLOSE)) { if ((data$Near_52_Week_High[i] == "Not Near 52-Week High") == TRUE){ data$Profit_Loss[i] = 0 data$Nifty_change[i] = round(Delt(data$NIFTY[i],data$NIFTY[i+1])*100,2) } }

    Step 6: In this step, we write an R code which creates a summary sheet of all the trades for each month in the backtest period. A sample summary sheet has been shown below. It also includes the Profit/Loss from every trade undertaken during the month.

    # Create a Summary worksheet for all the trades during a particular month final_data = final_data[-1,] final_data = subset(final_data,select=c(TICKER,DATE,CLOSE,NEXT_CLOSE,Max_52, Near_52_Week_High,Profit_Loss,Nifty_change), subset=(Near_52_Week_High == "Near 52-Week High")) colnames(final_data) = c("Ticker","Date","Close_Price","Next_Close_Price", "Max. 52-Week price","Is Stock near 52-Week high", "Profit_Loss","Nifty_Change") merged_file = paste(date_values[a],"- Summary.csv") write.csv(final_data,merged_file)

    Step 7: In the final step, we compute the portfolio performance over the entire backtest period and also plot the equity curve using the PerformanceAnalytics package in R. The portfolio performance is saved in a CSV file.

    cum_returns = Return.cumulative(eq_ts, geometric = TRUE) print(cum_returns) charts.PerformanceSummary(eq_ts,geometric=TRUE, wealth.index = FALSE) print(SharpeRatio.annualized(eq_ts, Rf = 0, scale = 12, geometric = TRUE))

    A sample summary of the portfolio performance has been shown below. In this case, the input parameters to our trading strategy were as follows:

    Plotting the Equity Curve

    As can be observed from the equity curve, our trading strategy performed well during the initial period and then suffered drawdowns in the middle of the backtest period. The Sharpe ratio for the trading strategy comes to 0.4098.

    Cumulative Return 1.172446 Annualized Sharpe Ratio (Rf=0%) 0.4098261

    This was a simple trading strategy that we developed using the 52-week high effect explanation. One can tweak this trading strategy further to improve its performance and make it more robust or try it out on different markets.

    Next Step

    You can explore other trading strategies listed on the Quantpedia site under their screener page and if interested you can sign up to get access to hundreds of exciting trading strategies.

    If you want to learn various aspects of Algorithmic trading then check out our Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ is designed to equip you with the right skill sets to be a successful trader. Enroll now!

    The post Trading Strategy: 52-Weeks High Effect in Stocks appeared first on .

    setTimeout("location.href = \'https://www.r-bloggers.com/trading-strategy-52-weeks-high-effect-in-stocks/\';",30500);
    Categories: Methodology Blogs

    ggplot2 style plotting in Python

    Mon, 2017-05-08 05:14

    (This article was first published on Environmental Science and Data Analytics, and kindly contributed to R-bloggers)

    R is my language of choice for data science but a good data scientist should have some knowledge of all of the great tools available to them. Recently, I have been gleefully using Python for machine learning problems (specifically pandas and the wonderful scikit-learn). However, for all its greatness, I couldn’t help but feel it lacks a bit in the data visualisation department. Don’t get me wrong, matplotlib can be used to produce some very nice visualisations but I think the code is a bit messy and quite unintuitive when compared to Hadley Wickham’s ggplot2.

    I’m a huge fan of the ggplot2 package and was delighted to discover that there has been an attempt to replicate its style in Python via the ggplot package. I wanted to compare the two packages and see just how well ggplot matches up to ggplot2. Both packages contain built-in datasets and I will use the mtcars data to build a series of plots to see how they compare, both visually and syntactically.

    Here we go….

    Scatterplots

    Scatterplots are great for bivariate profiling and revealing relationships between variables.  A simple scatterplot using ggplot2 in R:

    ggplot(mtcars , aes(x = hp , y = mpg)) +   geom_point()

    The same scatterplot using ggplot in Python:

    ggplot(mtcars , aes(x = 'hp' , y = 'mpg')) +\     geom_point()

    Not much of a difference there. The syntax is also very similar but in Python’s ggplot, there is a \ after each + when adding a new layer to a plot. When mapping variables to the xy coordinates, the use of inverted commas is also required in ggplot.

    Boxplots

    Boxplots are very nice for visualising discrete variables and the distributions of variables across them. In R’s ggplot2, I discretise the cyl variable with the factor() function to create a boxplot showing the distributions of mpg across each number of cylinders category (4, 6 and 8).

    ggplot(mtcars , aes(x = factor(cyl) , y = mpg)) +   geom_boxplot()

     

    In Python, we need to discretise the cyl variable with the pandas.factorize() function before plotting with ggplot. The ordering is different in the Python plot output but reordering may be possible as it is in R. Also, note that the number of cylinders have been assigned dummy variables where 0 = 6 cylinders, 1 = 4 cylinders, and 2 = 8 cylinders.

    mtcars['cyl'] = pd.factorize(mtcars.cyl)[0] ggplot(mtcars , aes(x = 'cyl' , y = 'mpg')) +\ geom_boxplot() Histograms

    A fundamental tool for univariate profiling, histograms show the frequency distribution of a variable. In R’s ggplot2, I plot the distribution of mpg across the mtcars data and add a few more components such as margin outlines and red fill while bins are set to ten and x axis tick labels are modified.

    ggplot(mtcars , aes(x = mpg)) +   geom_histogram(colour = "black" , fill = "red" , bins = 10) +   scale_x_continuous(breaks = seq(0 , 40, 5))

    With Python’s ggplot, the histogram is not as tidy. I couldn’t find a way to colour the margins black but there may be a way around this? The shape of the distribution looks a little different as well despite bins also being set to ten but this is just down to how the factoring is carried out in each language; the information within the plots is the same.

    ggplot(mtcars , aes(x = 'mpg')) +\ geom_histogram(fill = 'red' , bins = 10) Facet Plots

    The facet wrapping function in ggplot2 can create fantastic visualisations when using larger datasets. A simple example is given in both implementations.

    In R’s ggplot2, quarter mile time (qsec) is plotted against horsepower (hp) for each number of cylinders category. The facet wrapping splits the data into the specified discrete variable, in this case cyl, and plots the qsec/cyl relationship for each one.

    ggplot(mtcars , aes(x = hp , y = qsec)) + geom_point() + facet_wrap(~factor(cyl))

    And in Python’s ggplot (note the same dummy variables for cyl are used), a similar ouput is seen. The slight difference is the absence of the grey border along the top of each plot in ggplot.

    ggplot(mtcars , aes(x = 'hp' , y = 'qsec')) +\ geom_point() +\ facet_wrap(~'cyl') Making Things a Little Fancier

    The previous examples are very simple, but fundamental, plots in data science. With ggplot2 in R, one can be highly creative with their data visualisations by representing categories by colour, facet wrapping, etc. to create plots which hold a lot of information. The following plots are quick examples of how one can be more creative using both packages.

    In ggplot2, mpg is plotted against hp with each data instance now coloured according to number of cylinders. ggplot(mtcars , aes(x = hp , y = mpg , colour = factor(cyl))) + geom_point() In Python’s ggplot, a similar approach is used to colour each data instance by model of vehicle. ggplot(mtcars , aes(x = 'hp' , y = 'mpg' , color = 'name')) +\ geom_point() Using another built-in dataset (diamonds), we can look at the distribution of diamond prices by cut and colour is ironically used to represent diamond colour. ggplot(diamonds , aes(x = price , fill = color)) + geom_histogram(colour = "black") + facet_wrap(~cut) Python’s ggplot can produce a similar type of visualisation which in my opinion is less slick than the ggplot2 version. ggplot(diamonds , aes(x = 'price' , fill = 'color')) +\ geom_histogram(colour = 'black') +\ facet_wrap('cut')  Conclusion

    That concludes this brief comparison of ggplot2 and ggplot. It is by no means exhaustive and I’m sure there are many ways of modifying Python plots in ggplot which I am unaware of for now. However, I am very grateful to the ggplot package creator Greg Lamp for allowing R fans to create ggplot2 style plots in Python and look forward to using the package in my Python endeavours.

    To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/ggplot2-style-plotting-in-python/\';",30500);
    Categories: Methodology Blogs

    R Quick Tip: parameter re-use within rmarkdown YAML

    Mon, 2017-05-08 04:32

    (This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

    Ever wondered how to make an rmarkdown title dynamic? Maybe, wanted to use a parameter in multiple locations? Maybe wanted to pass through a publication date? Advanced use of YAML headers can help!

    Normally, when we write rmarkdown, we might use something like the basic YAML header that the rmarkdown template gives us.

    --- title: "My report" date: "18th April, 2017" output: pdf_document ---

    You may already know the trick about making the date dynamic to whatever date the report gets rendered on by using the inline R execution mode of rmarkdown to insert a value.

    --- title: "My report" date: "`r Sys.Date()`" output: pdf_document ---

    What you may not already know is that YAML fields get evaluated sequentially so you can use a value created further up in the params section, to use it later in the block.

    --- params: dynamictitle: "My report" reportdate: !r Sys.Date() output: pdf_document title: "`r params$dynamictitle`" date: "`r params$reportdate`" ---

    By doing this, you can then pass parameter values into the render() function when you want to generate a report.

    rmarkdown::render("MWE.Rmd", params=list(dynamictitle="New", reportdate=Sys.Date()-1) )

    Parameter re-use within rmarkdown enables you to dynamically generate vital metadata for the report and use values in multiple places within the document. Very handy!

    The post R Quick Tip: parameter re-use within rmarkdown YAML appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

    To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/r-quick-tip-parameter-re-use-within-rmarkdown-yaml/\';",30500);
    Categories: Methodology Blogs

    Shiny Application Layouts Exercises (Part-5)

    Sun, 2017-05-07 12:44

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Shiny Application Layouts-Vertical Layout

    In the fifth part of our series we will apply the kmeans() function to the iris dataset to create a shiny application. The difference is that now we will display its result vertically.

    This part can be useful for you in two ways.

    First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

    Secondly you can make a revision on what you learnt in “Building Shiny App” series as we will build basic shiny staff in order to present it in the proper way.

    Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

    Answers to the exercises are available here.

    Vertical Layout

    Every element passed to verticalLayout() will occupy a full row. The difference with fluidRow() is that verticalLayout() does not expect to be another column inside. Look at the example below:
    #ui.R
    fluidPage(
    verticalLayout(
    titlePanel("example"),
    plotOutput("plot"),
    wellPanel(
    sliderInput("p", "Points", 10, 200,
    value = 50, step = 10)
    )
    )
    )

    Exercise 1

    Create the initial UI. HINT: Use fluidpage().

    Exercise 2

    Apply the vertical layout to your UI. HINT: Use verticalLayout().

    Exercise 3

    Give the title “Vertical Iris” to your app. HINT: Use titlePanel().

    Exercise 4

    Create the ui side of your plot. HINT: Use plotOutput.

    You can use the wellPanel() function which will help you keep your elements separated. Now you should create a selectInput() with the X variables of the iris dataset.
    #ui.R
    wellPanel(selectInput('x', 'X Variable', names(iris)))

    Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

    Exercise 5

    Put a selectInput() to choose the X variable under your plot. HINT: Use selectInput() and wellPanel().

    In the next two rows we will put with the same way a selectInput() and a numericInput() for the Y Variable and the number of clusters respectively, like the example below.
    #ui.R
    selectInput('y', 'Y Variable', names(iris),
    selected=names(iris)[[2]]),
    numericInput('clusters', 'ClusterS', 3,
    min = 1, max = 9)

    Exercise 6

    Create a selectInput() for the Y Variable and a numericInput() to decide the number of clusters and separate them. The features of both are up to you.

    Now we have to combine the selected variables for x and y into a new data frame like the example below:
    #ui.R
    comb <- reactive({
    data[, c(input$x, input$y)]
    })

    Exercise 7

    Combine the selected variables into a new data frame. HINT: Use reactive({}).

    In order to apply the kmeans() function and decide the number of clusters, to the new data frame we have to use reactivity again. Here is an example:
    #ui.R
    clusters <- reactive({
    kmeans(comb(), input$clusters)
    })

    Exercise 8

    Apply the kmeans() function to Data() and use as input the number of clusters. HINT: Use reactive({}).

    Exercise 9

    Plot Data() to activate your plot on the server side. HINT: Use renderPlot().

    To create a completed kmeans plot you have to add cluster centers and separate each team by color. You can do this with this part of code.
    col = clusters()$cluster,
    pch = 20, cex = 3)
    points(clusters()$centers, pch = 4, cex = 4, lwd = 4)

    Exercise 10

    Add the code above in the correct place of your server side to complete your kmeans plot.

    Related exercise sets:
    1. Building Shiny App exercises part 7
    2. Building Shiny App exercises part 4
    3. Building Shiny App exercises part 6
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/shiny-application-layouts-exercises-part-5/\';",30500);
    Categories: Methodology Blogs

    Know your data structures!

    Sun, 2017-05-07 12:43

    (This article was first published on r-bloggers – verenahaunschmid, and kindly contributed to R-bloggers)

    Just a few days ago I stated the following on Twitter:

    Since my tweet has been liked and shared a lot, I thought I’d backup my claim with a small experiment.

    I’ve used my 6 year old Macbook Pro, so if you try this on your computer everything is probably faster!

    Background

    The problem I tried to solve was to implement a genetic algorithm initially programmed in python by Randy Olson. I’m participating in a Data Science Learning Club and the current activity was about Genetic Algorithms. The jupyter notebook by Randy was given as an example and we can always choose by ourselves how we want to approach a topic in the learning club. Since I’m currently not doing much in R, I decided it would be a good practice to stay up-to-date.

    Here is the complete code for the assignment in my git repo.

    Porting the code

    There will be a lot more detail about the genetic algorithm code in a future blog post. Basically porting it was straight forward. I had to take care of the indices (R starts counting at 1, python at 0) and for me, Python’s indexing is not very intuitive. But I managed to create a running program. Randy’s settings were to use 5000 generations and 100 populations. His python code only took a couple of minutes on my computer.

    Running time

    When I first wanted to run my R code I immediately noticed something was wrong. It needed approximately 7 seconds for one generation, which would result in a complete running time of 9 hours! I stopped the program and tried to investigate what was going on.

    I debugged the code and found which part of the function run_genetic_algorithm was the slowest:

    for (generation in 1:generations) { population_fitness <- list() for (i in 1:length(population)) { if (!contains(population_fitness, population[[i]])) { fit <- compute_fitness_fast(distance_matrix, population[[i]]) # here I had used the method compute_fitness instead of compute_fitness_fast agent_fitness <- list("agent"=population[[i]], "fitness"=unname(fit)) population_fitness[[length(population_fitness)+1]] <- agent_fitness } }

    This part accounted for more than 5 seconds of the 7 seconds per loop!

    At first I thought the loop was the problem because as we all know you should avoid loops in R and use apply, lapply and so on ...

    So as a first step I changed the implementation of compute_fitness.

    This is the old implementation:

    compute_fitness <- function(data_store, solution) { solution_fitness <- 0.0 for (index in 2:length(solution)) { waypoint1 <- solution[index - 1] waypoint2 <- solution[index] solution_fitness <- solution_fitness + waypoint_distances(data_store, waypoint1, waypoint2) } return(solution_fitness) }

    This was the first improvement using lapply:

    compute_fitness_fast <- function(distance_matrix, solution) { fitness_list <- lapply(2:length(solution), function(index) { waypoint1 <- solution[index - 1] waypoint2 <- solution[index] waypoint_distances(data_store, waypoint1, waypoint2) }) return(Reduce("+", fitness_list)) }

    This made it slightly better but only cut about half a second from the 5 seconds. Overall this would not reduce the running time by very much. Also notice the use of the method Reduce. While researching on how to improve my code, I also found out about Common Higher-Order Functions in Functional Programming Languages, which I didn't know existed in R (you might have heard about map/reduce operations before). That's very practical!

    Don't always use data.frames

    Luckily I had the idea that maybe the data.frame where I stored the distances was the problem, since I had to do thousands of lookups. I researched a little and found out that matrices have better memory performance. Also the typical structure for storing pairwise distances is a distance matrix... I'm not sure if that's the whole story and what else is going on when I lookup something in a data.frame, but I thought it's worth a try.
    I wrote a small script to transform the data.frame into a matrix:

    build_distance_matrix <- function(data_store) { all_waypoints <- union(data_store$waypoint1, data_store$waypoint2) n <- length(all_waypoints) dist_m <- matrix(rep(0, n*n), nrow=n) for (i in 1:n) { for (j in 1:n) { if (i == j) { dist_m[i, j] <- 0 } else if (i < j) { dist_ij <- unlist(waypoint_distances(data_store, all_waypoints[i], all_waypoints[j])) dist_m[i, j] <- dist_ij dist_m[j, i] <- dist_ij } } } colnames(dist_m) <- rownames(dist_m) <- all_waypoints return(dist_m) }

    Since I use the city names as column and row names, I can directly use them to access the distance!

    And I changed the method compute_fitness_fast to this:

    compute_fitness_fast <- function(distance_matrix, solution) { fitness_list <- lapply(2:length(solution), function(index) { waypoint1 <- solution[index - 1] waypoint2 <- solution[index] distance_matrix[waypoint1, waypoint2] }) return(Reduce("+", fitness_list)) }

    And yay, I immediately noticed a huge speedup from 5.619 seconds to 0.042 seconds per generation, which is 133 times faster!


    My previous data structure, a data.frame of pairwise distances.
    My new, faster data structure - a distance matrix. microbenchmark

    To do the comparison properly, I used the R package microbenchmark (link to documentation). You can pass one or more functions/expressions to the microbenchmark method and tell it how often to run the functions for the comparison.

    res <- microbenchmark(matrix=({ population_fitness <- list() for (agent_genome in population) { if (!contains(population_fitness, agent_genome)) { fit <- compute_fitness_fast(dist_matrix, agent_genome) agent_fitness <- list("agent"=agent_genome, "fitness"=unname(fit)) population_fitness[[length(population_fitness)+1]] <- agent_fitness } }}), df=({ population_fitness <- list() for (agent_genome in population) { if (!contains(population_fitness, agent_genome)) { fit <- compute_fitness(data_store_us, agent_genome) agent_fitness <- list("agent"=agent_genome, "fitness"=unname(fit)) population_fitness[[length(population_fitness)+1]] <- agent_fitness } }}), times = 100)

    If you want to plot the results using autoplot it is important to give the functions names (e.g. matrix and df) otherwise it will try to put the whole expressions onto the y axis labels.


    Pretty clear results!

    And as you can see, just by replacing the data.frame by the more suitable matrix, this important part of my algorithm became a lot faster!

    >> Using the microbenchmark package to compare the execution time of R expressions | gettinggeneticsdone.com

    Takeaway from this post

    What I learned from this problem and maybe also you, is that you should never just use the data structure that is given. I simply read the csv file that was stored by the Jupyter notebook (it seems to be very efficient in Python) and used the first data structure that seemed to fit. In this case it would have made sense to think about what the underlying structure of the data is. There is one number for each pair of cities, which makes a matrix a real good fit. The name distance matrix already suggests how it should be stored

    I'd also like to know more about the underlying reason why there is such a huge distance. If you have any more resources, I'd like to read/hear about it!

    The post Know your data structures! appeared first on verenahaunschmid.

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – verenahaunschmid. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/know-your-data-structures/\';",30500);
    Categories: Methodology Blogs

    Plot the Vote: Making U.S. Senate & House Cartograms in R

    Sun, 2017-05-07 12:37

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    Political machinations are a tad insane in the U.S. these days & I regularly hit up @ProPublica & @GovTrack sites (& sub to the GovTrack e-mail updates) as I try to be an informed citizen, especially since I’ve got a Senator and Representative who seem to be in the sway of

    Categories: Methodology Blogs

    RInside 0.2.14

    Sun, 2017-05-07 11:43

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A new release 0.2.14 of RInside is now on CRAN and in Debian.

    RInside provides a set of convenience classes which facilitate embedding of R inside of C++ applications and programs, using the classes and functions provided by Rcpp.

    It has been nearly two years since the last release, and a number of nice extensions, build robustifications and fixes had been submitted over this period—see below for more. The only larger and visible extension is both a new example and some corresponding internal changes to allow a readline prompt in an RInside application, should you desire it.

    RInside is stressing the CRAN system a little in that it triggers a number of NOTE and WARNING messages. Some of these are par for the course as we get close to R internals not all of which are "officially" in the API. This lead to the submission sitting a little longer than usual in incoming queue. Going forward we may need to find a way to either sanction these access point, whitelist them or, as a last resort, take the package off CRAN. Time will tell.

    Changes since the last release were:

    Changes in RInside version 0.2.14 (2017-04-28)
    • Interactive mode can use readline REPL (Łukasz Łaniewski-Wołłk in #25, and Dirk in #26)

    • Windows macros checks now uses _WIN32 (Kevin Ushey in #22)

    • The wt example now links with libboost_system

    • The Makevars file is now more robist (Mattias Ellert in #21)

    • A problem with empty environment variable definitions on Windows was addressed (Jeroen Ooms in #17 addressing #16)

    • HAVE_UINTPTR_T is defined only if not already defined

    • Travis CI is now driven via run.sh from our forked r-travis

    CRANberries also provides a short report with changes from the previous release. More information is on the RInside page. Questions, comments etc should go to the rcpp-devel mailing list off the Rcpp R-Forge page, or to issues tickets at the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/rinside-0-2-14/\';",30500);
    Categories: Methodology Blogs

    dplyr in Context

    Sat, 2017-05-06 22:05

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    Introduction

    Beginning R users often come to the false impression that the popular packages dplyr and tidyr are both all of R and sui generis inventions (in that they might be unprecedented and there might no other reasonable way to get the same effects in R). These packages and their conventions are high-value, but they are results of evolution and implement a style of programming that has been available in R for some time. They evolved in a context, and did not burst on the scene fully armored with spear in hand.

    Zeus giving birth to Athena, Rudolph Tegner

    dplyr and tidyr

    We will start with a (very) brief outline of the primary capabilities of dplyr and tidyr.

    dplyr

    dplyr embodies the idea that data manipulation should be broken down into a sequence of transformations.

    For example: in R if one wishes to add a column to a data.frame it is common to perform an "in-place" calculation as shown below:

    d <- data.frame(x=c(-1,0,1)) print(d) ## x ## 1 -1 ## 2 0 ## 3 1 d$absx <- abs(d$x) print(d) ## x absx ## 1 -1 1 ## 2 0 0 ## 3 1 1

    This has a couple of disadvantages:

    • The original d has been altered, so re-starting calculations (say after we discover a mistake) can be inconvenient.
    • We have to keep repeating the name of the data.frame which is not only verbose (which is not that important an issue), it is a chance to write the wrong name and introduce an error.

    The "dplyr-style" is to write the same code as follows:

    suppressPackageStartupMessages(library("dplyr")) d <- data.frame(x=c(-1,0,1)) d %>% mutate(absx = abs(x)) ## x absx ## 1 -1 1 ## 2 0 0 ## 3 1 1 # confirm our original data frame is unaltered print(d) ## x ## 1 -1 ## 2 0 ## 3 1

    The idea is to break your task into the sequential application of a small number of "standard verbs" to produce your result. The verbs are "pipelined" or sequenced using the magrittr pipe "%>%" which can be thought of as if the following four statements were to be taken as equivalent:

    • f(x)
    • x %>% f(.)
    • x %>% f()
    • x %>% f

    This lets one write a sequence of operations as a left to right pipeline (without explicit nesting of functions or use of numerous intermediate variables). Some discussion can be found here.

    Primary dplyr verbs include the "single table verbs" from the dplyr 0.5.0 introduction vignette:

    • filter() (and slice())
    • arrange()
    • select() (and rename())
    • distinct()
    • mutate() (and transmute())
    • summarise()
    • sample_n() (and sample_frac())

    These have high-performance implementations (often in C++ thanks to Rcpp) and often have defaults that are safer and better for programming (not changing types on single column data-frames, not promoting strings to factors, and so-on). Not really discussed in the dplyr 0.5.0 introduction are the dplyr::*join() operators which are in fact critical components, but easily explained as standard relational joins (i.e., they are very important implementations, but not novel concepts).

    Fairly complex data transforms can be broken down in terms of these verbs (plus some verbs from tidyr):

    Take for example a slightly extended version of one of the complex work-flows from dplyr 0.5.0 introduction vignette.

    The goal is: plot the distribution of average flight arrive delays and flight departure (all averages grouped by date) for dates where either of these averages is at least 30 minutes. The first step is writing down the goal (as we did above). With that clear, someone familiar with dplyr can write a pipeline or work-flow as below (we have added the gather and arrange steps to extend the example a bit):

    library("nycflights13") suppressPackageStartupMessages(library("dplyr")) library("tidyr") library("ggplot2") summary1 <- flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) %>% gather(key = delayType, value = delayMinutes, arr, dep) %>% arrange(year, month, day, delayType) ## Adding missing grouping variables: `year`, `month`, `day` dim(summary1) ## [1] 98 5 head(summary1) ## Source: local data frame [6 x 5] ## Groups: year, month [2] ## ## year month day delayType delayMinutes ## ## 1 2013 1 16 arr 34.24736 ## 2 2013 1 16 dep 24.61287 ## 3 2013 1 31 arr 32.60285 ## 4 2013 1 31 dep 28.65836 ## 5 2013 2 11 arr 36.29009 ## 6 2013 2 11 dep 39.07360 ggplot(data= summary1, mapping=aes(x=delayMinutes, color=delayType)) + geom_density() + ggtitle(paste("distribution of mean arrival and departure delays by date", "when either mean delay is at least 30 minutes", sep='\n'), subtitle = "produced by: dplyr/magrittr/tidyr packages")

    Once you get used to the notation (become familiar with "%>%" and the verbs) the above can be read in small pieces and is considered fairly elegant. The warning message indicates it would have been better documentation to have the initial select() have been "select(year, month, day, arr_delay, dep_delay)" (in addition I feel that group_by() should always be written as close to summarise() as is practical). We have intentionally (beyond minor extension) kept the example as is.

    But dplyr is not un-precedented. It was preceeded by the plyr package and many of these transformational verbs actually have near equivalents in the R name-space base:::

    • dplyr::filter() ~ base::subset()
    • dplyr::arrange() ~ base::order()
    • dplyr::select() ~ base::[]
    • dplyr::mutate() ~ base::transform()

    We will get back to these substitutions after we discuss tidyr.

    tidyr

    tidyr is a smaller package than dplyr and it mostly supplies the following verbs:

    • complete() (a bulk coalsece function)
    • gather() (a un-pivot operation, related to stats::reshape())
    • spread() (a pivot operation, related to stats::reshape())
    • nest() (a hierarchical data operation)
    • unnest() (opposite of nest(), closest analogy might be base::unlist())
    • separate() (split a column into multiple columns)
    • extract() (extract one column)
    • expand() (complete an experimental design)

    The most famous tidyr verbs are nest(), unnest(), gather(), and spread(). We will discuss gather() here as it and spread() are incremental improvements on stats::reshape().

    Note also the tidyr package was itself preceded by a package called reshape2, which supplied pivot capabilities in terms of verbs called melt() and dcast().

    The flights example again

    It may come as a shock to some: but one can roughly "line for line"" translate the "nycflights13" example from the dplyr 0.5.0 introduction into common methods from base:: and stats:: that reproduces the sequence of transforms style. I.e., transformational style is already available in "base- R".

    By "base-R" we mean R with only its standard name-spaces (base, util, stats and a few others). Or "R out of the box" (before loading many packages). "base-R" is not meant as a pejorative term here. We don’t take "base-R" to in any way mean "old-R", but to denote the core of the language we have decided to use for many analytic tasks.

    What we are doing is separating the style of programming taught "as dplyr" (itself a signficant contribution) from the implementation (also a significant contribution). We will replace the use of the magrittr pipe "%>%" with the Bizarro Pipe (an effect available in base-R) to produce code that works without use of dplyr, tidyr, or magrittr.

    The translated example:

    library("nycflights13") library("ggplot2") flights ->.; # select columns we are working with .[c('arr_delay', 'dep_delay', 'year', 'month', 'day')] ->.; # simulate the group_by/summarize by split/lapply/rbind transform(., key=paste(year, month, day)) ->.; split(., .$key) ->.; lapply(., function(.) { transform(., arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) )[1, , drop=FALSE] }) ->.; do.call(rbind, .) ->.; # filter to either delay at least 30 minutes subset(., arr > 30 | dep > 30) ->.; # select only columns we wish to present .[c('year', 'month', 'day', 'arr', 'dep')] ->.; # get the data into a long form # can't easily use stack as (from help(stack)): # "stack produces a data frame with two columns"" reshape(., idvar = c('year','month','day'), direction = 'long', varying = c('arr', 'dep'), timevar = 'delayType', v.names = 'delayMinutes') ->.; # convert reshape ordinals back to original names transform(., delayType = c('arr', 'dep')[delayType]) ->.; # make sure the data is in the order we expect .[order(.$year, .$month, .$day, .$delayType), , drop=FALSE] -> summary2 # clean out the row names for clarity of presentation rownames(summary2) <- NULL dim(summary2) ## [1] 98 5 head(summary2) ## year month day delayType delayMinutes ## 1 2013 1 16 arr 34.24736 ## 2 2013 1 16 dep 24.61287 ## 3 2013 1 31 arr 32.60285 ## 4 2013 1 31 dep 28.65836 ## 5 2013 2 11 arr 36.29009 ## 6 2013 2 11 dep 39.07360 ggplot(data= summary2, mapping=aes(x=delayMinutes, color=delayType)) + geom_density() + ggtitle(paste("distribution of mean arrival and departure delays by date", "when either mean delay is at least 30 minutes", sep='\n'), subtitle = "produced by: base/stats packages plus Bizarro Pipe") print(all.equal(as.data.frame(summary1),summary2)) ## [1] TRUE

    The above work-flow is a bit rough, but the simple introduction of a few light-weight wrapper functions would clean up the code immensely.

    The ugliest bit is the by-hand replacement of the group_by()/summarize() pair, so that would be a good candidate to wrap in a function (either full split/apply/combine style or some specialization such as grouped ordered apply).

    The reshape step is also a bit rough, but I like the explicit specification of idvars (without these the person reading the code has little idea what the structure of the intended transform is). This is why even though I prefer the tidyr::gather() implementation to stats::reshape() I chose to wrap tidyr::gather() into a more teachable "coordinatized data" signature (the idea is: explicit grouping columns were a good idea for summarize(), and they are also a good idea for pivot/un-pivot).

    Also, the use of expressions such as ".$year" is probably not a bad thingl; dplyr itself is introducing "data pronouns" to try and reduce ambiguity and would write some of these expressions as ".data$year". In fact the dplyr authors consider notations such as "mtcars %>% select(.data["disp"])" as recommended notation (though at this point one is just wrapping the base-R version "mtcars ->.; .[["disp"]]" in a needless "select()").

    Conclusion

    R itself is very powerful. That is why additional powerful notations and powerful conventions can be built on top of R. R also, for all its warts, has always been a platform for statistics and analytics. So: for common data manipulation tasks you should expect R does in fact have some ready-made tools.

    It is often said "R is its packages", but I think that is missing how much R packages owe back to design decisions found in "base-R".

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/dplyr-in-context/\';",30500);
    Categories: Methodology Blogs

    Evolution of ice hockey players’ height: IIHF world championships 2001-2016

    Sat, 2017-05-06 20:00

    (This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

    The 2017 Ice Hockey World Championship has started. Thus I want to share a small research on the height of ice hockey players that I did almost a year ago and published in Russian.

    When the TV camera shows the players returning to the changing rooms, it is difficult not to notice just how huge the players are compared to the surrounding people – fans, journalists, coaches, or the ice arena workers. For example, here are the rising stars of the Finnish hockey – Patrik Laine and Aleksander Barkov – with the two fans in between.


    Source

    So the questions arise. Are ice hockey players really taller than average people? How is the height of ice hockey players evolving over time? Are there any lasting differences between countries?

    Data

    IIHF, the organization that is in charge for the ice hockey world championships, publishes detailed information on the squads, including the data on player’s height and weight. The raw data files are here. I gathered the data of all players that participated in the 16 world championships between 2001 and 2016. The formatting of the data files changes from year to year complicating the data processing. So I did the data cleaning manually which took a bit more than 3 hours. The unifies dataset is here. Let’s load the data and prepare the R session.

    # load required packages library(tidyverse) # data manipulation and viz library(lubridate) # easy manipulations with dates library(ggthemes) # themes for ggplot2 library(texreg) # easy export of regression tables library(xtable) # export a data frame into an html table library(sysfonts) # change the font in figures # download the IIHF data set; if there are some problems, you can download manually # using the stable URL (https://dx.doi.org/10.6084/m9.figshare.3394735.v2) df <- read.csv("https://ndownloader.figshare.com/files/5303173") # color palette brbg11 <- RColorBrewer::brewer.pal(11, "BrBG") Do the players become taller? (a crude comparison)

    Let’s first have a look at the pulled average height of all the players that participated.

    # mean height by championship df_per <- df %>% group_by(year) %>% summarise(height = mean(height)) gg_period_mean <- ggplot(df_per, aes(x = year, y = height))+ geom_point(size = 3, color = brbg11[9])+ stat_smooth(method = "lm", size = 1, color = brbg11[11])+ ylab("height, cm")+ xlab("year of competition")+ scale_x_continuous(breaks = seq(2005, 2015, 5), labels = seq(2005, 2015, 5))+ theme_few(base_size = 15, base_family = "mono")+ theme(panel.grid = element_line(colour = "grey75", size = .25)) gg_period_jitter <- ggplot(df, aes(x = year, y = height))+ geom_jitter(size = 2, color = brbg11[9], alpha = .25, width = .75)+ stat_smooth(method = "lm", size = 1, se = F, color = brbg11[11])+ ylab("height, cm")+ xlab("year of competition")+ scale_x_continuous(breaks = seq(2005, 2015, 5), labels = seq(2005, 2015, 5))+ theme_few(base_size = 15, base_family = "mono")+ theme(panel.grid = element_line(colour = "grey75", size = .25)) gg_period <- cowplot::plot_grid(gg_period_mean, gg_period_jitter)

    Figure 1. The dynamics of the average height of the ice hockey players at the world championships, 2001–2016

    The positive trend is evident. In the 15 years the average height of a player increased by almost 2 cm (left panel). Is that a lot? To have an idea, we will compare this growth to the dynamics in the population, later in the post.

    Cohort approach

    A more correct way to study the dynamics of players’ height is to do the comparison between birth cohorts. Here we face an interesting data preparation issue – some of the players participated in more that one championships. The question is: do we need to clean the duplicate records? If the goal is to see the average height of a player at the certain championship (as in Figure 1), it is reasonable to keep all the records. Alternatively, if the aim is to analyze the dynamics of players’ height itself, I argue, it would be wrong to assign bigger weight to those players that participated in more that one championship. Thus, for the further cohort analysis, I cleaned the dataset from the duplicates.

    dfu_h <- df %>% select(year, name, country, position, birth, cohort, height) %>% spread(year, height) dfu_h$av.height <- apply(dfu_h[, 6:21], 1, mean, na.rm = T) dfu_h$times_participated <- apply(!is.na(dfu_h[, 6:21]), 1, sum) dfu_w <- df %>% select(year, name, country, position, birth, cohort, weight) %>% spread(year, weight) dfu_w$av.weight <- apply(dfu_w[, 6:21], 1, mean, na.rm = T) dfu <- left_join(dfu_h %>% select(name, country, position, birth, cohort, av.height, times_participated), dfu_w %>% select(name, country, position, birth, cohort, av.weight), by = c("name", "country", "position", "birth", "cohort")) %>% mutate(bmi = av.weight / (av.height / 100) ^ 2)

    The total number of observations decreased from 6292 to 3333. For those who participated in more that one championship, I averaged the data on height and weight as they can change during the life-course. How many times, on average, are ice hockey players honored to represent their countries in the world championships? A bit less than 2.

    # frequencies of participation in world championships mean(dfu$times_participated) df_part <- as.data.frame(table(dfu$times_participated)) gg_times_part <- ggplot(df_part, aes(y = Freq, x = Var1))+ geom_bar(stat = "identity", fill = brbg11[8])+ ylab("# of players")+ xlab("times participated (out of 16 possible)")+ theme_few(base_size = 15, base_family = "mono")+ theme(panel.grid = element_line(colour = "grey75", size = .25))

    Figure 2. Histogram of the players by the number of times they participated in world championships over the period 2001-2016.

    But there are unique players that participated in a considerable number of championships. Let’s have a look at those who participated at least 10 times out of 16 possible. There were just 14 such players.

    # the leaders of participation in world championships leaders <- dfu %>% filter(times_participated > 9) View(leaders) # save the table to html print(xtable(leaders), type = "html", file = "table_leaders.html")

    Table 1. The most frequently participated players

    .tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:3px 3px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:3px 3px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg .tg-ls8f{font-family:Georgia, serif !important;} .tg .tg-oa1s{font-weight:bold;font-family:Georgia, serif !important;} .tg .tg-jrsh{font-family:Georgia, serif !important;;text-align:center} .tg .tg-lyle{font-weight:bold;font-family:Georgia, serif !important;;text-align:center} name country position birth date cohort av.height times _participated av.weight bmi ovechkin alexander RUS F 1985-09-17 1985 188.45 11 98.36 27.70 nielsen daniel DEN D 1980-10-31 1980 182.27 11 79.73 24.00 staal kim DEN F 1978-03-10 1978 182.00 10 87.80 26.51 green morten DEN F 1981-03-19 1981 183.00 12 85.83 25.63 masalskis edgars LAT G 1980-03-31 1980 176.00 12 79.17 25.56 ambuhl andres SUI F 1983-09-14 1983 176.80 10 83.70 26.78 granak dominik SVK D 1983-06-11 1983 182.00 10 79.50 24.00 madsen morten DEN F 1987-01-16 1987 189.82 11 86.00 23.87 redlihs mikelis LAT F 1984-07-01 1984 180.00 10 80.40 24.81 cipulis martins LAT F 1980-11-29 1980 180.70 10 82.10 25.14 holos jonas NOR D 1987-08-27 1987 180.18 11 91.36 28.14 bastiansen anders NOR F 1980-10-31 1980 190.00 11 93.64 25.94 ask morten NOR F 1980-05-14 1980 185.00 10 88.30 25.80 forsberg kristian NOR F 1986-05-05 1986 184.50 10 87.50 25.70

    Alexander Ovechkin – 11 times! But it has to be noted that not every player had a possibility to participate in all the 16 championships between 2001 and 2016. That depends on a numder of factors:

    • the birth cohort of the player;
    • whether his national team regularly qualified for the championship (Figure 3);
    • whether the player was good enough for the national team;
    • whether he was free from the NHL play-offs that often keep the best players off the world championships.
    # countries times participated df_cnt_part <- df %>% select(year, country, no) %>% mutate(country = factor(paste(country))) %>% group_by(country, year) %>% summarise(value = sum(as.numeric(no))) %>% mutate(value = 1) %>% ungroup() %>% mutate(country = factor(country, levels = rev(levels(country))), year = factor(year)) d_cnt_n <- df_cnt_part %>% group_by(country) %>% summarise(n = sum(value)) gg_cnt_part <- ggplot(data = df_cnt_part, aes(x = year, y = country))+ geom_point(color = brbg11[11], size = 7)+ geom_text(data = d_cnt_n, aes(y = country, x = 17.5, label = n, color = n), size = 7, fontface = 2)+ geom_text(data = d_cnt_n, aes(y = country, x = 18.5, label = " "), size = 7)+ scale_color_gradientn(colours = brbg11[7:11])+ xlab(NULL)+ ylab(NULL)+ theme_bw(base_size = 25, base_family = "mono")+ theme(legend.position = "none", axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

    Figure 3. Stats of the national teams participation in the world championships

    Do the ice hochey players become taller? (regression analysis)

    The regression analysis allows to address the research question – the association between player’s height and birth cohort – accounting for the cross-national differences and player’s position. I use OLS regressions, that are quite sensitive to outliers. I removed the birth cohorts for which there are less than 10 players – 1963, 1997, and 1998.

    # remove small cohorts table(dfu$cohort) dfuc <- dfu %>% filter(cohort < 1997, cohort > 1963)

    So, the results. I add the variables one by one.

    Dependent variable: player’s height.
    Explaining variables: 1) birth cohort; 2) position (compared to defenders); 3) country (compared to Russia).

    # relevel counrty variable to compare with Russia dfuc$country <- relevel(dfuc$country, ref = "RUS") # regression models m1 <- lm(data = dfuc, av.height~cohort) m2 <- lm(data = dfuc, av.height~cohort+position) m3 <- lm(data = dfuc, av.height~cohort+position+country) # export the models to html htmlreg(list(m1, m2, m3), file = "models_height.html", single.row = T)

    Table2. The models

    .tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:3px 3px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:3px 3px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg .tg-ls8f{font-family:Georgia, serif !important;} .tg .tg-t6te{font-style:italic;font-family:Georgia, serif !important;;text-align:center} .tg .tg-oa1s{font-weight:bold;font-family:Georgia, serif !important;} .tg .tg-jrsh{font-family:Georgia, serif !important;;text-align:center} .tg .tg-lyle{font-weight:bold;font-family:Georgia, serif !important;;text-align:center} .tg .tg-mmdc{font-style:italic;font-family:Georgia, serif !important;} Model 1 Model 2 Model 3 (Intercept) -10.17 (27.67) -18.64 (27.01) 32.59 (27.00) cohort 0.10 (0.01)*** 0.10 (0.01)*** 0.08 (0.01)*** positionF -2.59 (0.20)*** -2.59 (0.20)*** positionG -1.96 (0.31)*** -1.93 (0.30)*** countryAUT -0.94 (0.55) countryBLR -0.95 (0.53) countryCAN 1.13 (0.46)* countryCZE 0.56 (0.49) countryDEN -0.10 (0.56) countryFIN 0.20 (0.50) countryFRA -2.19 (0.69)** countryGER -0.61 (0.51) countryHUN -0.61 (0.86) countryITA -3.58 (0.61)*** countryJPN -5.24 (0.71)*** countryKAZ -1.16 (0.57)* countryLAT -1.38 (0.55)* countryNOR -1.61 (0.62)** countryPOL 0.06 (1.12) countrySLO -1.55 (0.58)** countrySUI -1.80 (0.53)*** countrySVK 1.44 (0.50)** countrySWE 1.18 (0.48)* countryUKR -1.82 (0.59)** countryUSA 0.54 (0.45) R2 0.01 0.06 0.13 Adj. R2 0.01 0.06 0.12 Num. obs. 3319 3319 3319 RMSE 5.40 5.27 5.10

    Model 1. One year change in the birth cohort year is associated with an increase of 0.1 cm in height. The coefficient is statistically significant, yet the variable explains only 1% of the variance. That’s not a big problem since the aim of the modeling is to document the differences, rather than predict based on the model. Nevertheless, the low coefficient of determination means that there are other variables that explain the differences in players’ height better than just the birth cohort.

    Model 2. Defenders are the tallest ice hockey players: goalkeepers are 2 cm shorter, forwards are 2.6 cm shorter. All the coefficients are significant; R squared rose to 6%. It is worth noting that the coefficient for the birth cohort did not change when we added the new variable.

    Model 3. It is interesting to control for countries for two reasons. First, some of the differences are significant themselves. For example, Swedes, Slovaks, and Canadians are higher than Russians. In contrast, Japanese are 5.2 cm shorter, Italians – 3.6 cm, French – 2.2 cm (figure 4). Second, once the country controls are introduced, the coefficient for birth cohort decreased slightly meaning that some of the differences in height are explained by persisting cross-country differences. R squared rose to 13%.

    # players' height by country gg_av.h_country <- ggplot(dfuc , aes(x = factor(cohort), y = av.height))+ geom_point(color = "grey50", alpha = .25)+ stat_summary(aes(group = country), geom = "line", fun.y = mean, size = .5, color = "grey50")+ stat_smooth(aes(group = country, color = country), geom = "line", size = 1)+ facet_wrap(~country, ncol = 4)+ coord_cartesian(ylim = c(170, 195))+ scale_x_discrete(labels = paste(seq(1970, 1990, 10)), breaks = paste(seq(1970, 1990, 10)))+ labs(x = "birth cohort", y = "height, cm")+ theme_few(base_size = 15, base_family = "mono")+ theme(legend.position = "none", panel.grid = element_line(colour = "grey75", size = .25))

    Figure 4. The height of ice hockey players by nations

    The last model indicates that from one birth cohort cohort to the other the height of ice hockey players increases 0.08 cm. That means an increase of 0.8 cm in a decade or a growth of 2.56 cm in the 32 years between 1964 and 1996. It is worth mentioning that once we run the analysis in cohorts and controlling for positions and nations, the speed of the player’s height increase becomes much humbler than in the crude pulled analysis (Figure 1): 0.8 cm per decade compared to 1.2 cm per decade.

    Before we go further and compare the growth in player’s height to that of the population, let’s do the modeling separately for defenders, goalkeepers, and forwards. The exploratory plot (Figure 5) suggests that the correlation is stronger for goalkeepers and weaker for defenders.

    dfuc_pos <- dfuc levels(dfuc_pos$position) <- c("Defenders", "Forwards", "Goalkeeprs") gg_pos <- ggplot(dfuc_pos , aes(x = cohort, y = av.height))+ geom_jitter(aes(color = position), alpha = .5, size = 2)+ stat_smooth(method = "lm", se = T, color = brbg11[11], size = 1)+ scale_x_continuous(labels = seq(1970, 1990, 10), breaks = seq(1970, 1990, 10))+ scale_color_manual(values = brbg11[c(8, 9, 10)])+ facet_wrap(~position, ncol = 3)+ xlab("birth cohort")+ ylab("height, cm")+ theme_few(base_size = 15, base_family = "mono")+ theme(legend.position = "none", panel.grid = element_line(colour = "grey75", size = .25))

    Figure 5. Correlation between height and birth cohort by position

    # separate models for positions m3d <- lm(data = dfuc %>% filter(position == "D"), av.height~cohort+country) m3f <- lm(data = dfuc %>% filter(position == "F"), av.height~cohort+country) m3g <- lm(data = dfuc %>% filter(position == "G"), av.height~cohort+country) htmlreg(list(m3d, m3f, m3g), file = "models_height_pos.html", single.row = T, custom.model.names = c("Model 3 D", "Model 3 F", "Model 3 G"))

    Table 3. Model 3 – separately for defenders, forwards, and goalkeepers

    .tg {border-collapse:collapse;border-spacing:0;} .tg td{font-family:Arial, sans-serif;font-size:14px;padding:3px 3px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:3px 3px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;} .tg .tg-ls8f{font-family:Georgia, serif !important;} .tg .tg-t6te{font-style:italic;font-family:Georgia, serif !important;;text-align:center} .tg .tg-oa1s{font-weight:bold;font-family:Georgia, serif !important;} .tg .tg-jrsh{font-family:Georgia, serif !important;;text-align:center} .tg .tg-lyle{font-weight:bold;font-family:Georgia, serif !important;;text-align:center} .tg .tg-mmdc{font-style:italic;font-family:Georgia, serif !important;} Model 3 D Model 3 F Model 3 G (Intercept) 108.45 (46.46)* 49.32 (36.73) -295.76 (74.61)*** cohort 0.04 (0.02) 0.07 (0.02)*** 0.24 (0.04)*** countryAUT 0.14 (0.96) -2.01 (0.75)** 0.47 (1.47) countryBLR 0.30 (0.87) -1.53 (0.73)* -2.73 (1.55) countryCAN 1.55 (0.78)* 0.39 (0.62) 3.45 (1.26)** countryCZE 0.87 (0.84) 0.30 (0.67) 0.63 (1.36) countryDEN -0.60 (0.95) 0.10 (0.75) -0.19 (1.62) countryFIN -0.55 (0.89) -0.04 (0.67) 2.40 (1.32) countryFRA -3.34 (1.15)** -2.06 (0.93)* 1.39 (2.07) countryGER 0.48 (0.85) -1.40 (0.72) -0.65 (1.33) countryHUN -1.32 (1.47) -0.70 (1.16) 0.65 (2.39) countryITA -2.08 (1.08) -4.78 (0.82)*** -2.02 (1.62) countryJPN -4.13 (1.26)** -6.52 (0.94)*** -2.27 (1.98) countryKAZ -1.23 (0.95) -1.82 (0.79)* 1.79 (1.58) countryLAT -0.73 (0.95) -1.39 (0.75) -3.42 (1.49)* countryNOR -3.25 (1.07)** -1.06 (0.85) -0.10 (1.66) countryPOL 0.82 (1.89) -0.58 (1.55) 0.37 (2.97) countrySLO -1.57 (0.99) -1.54 (0.79) -2.25 (1.66) countrySUI -1.98 (0.91)* -2.36 (0.71)*** 1.12 (1.47) countrySVK 2.94 (0.87)*** 0.81 (0.67) -0.70 (1.50) countrySWE 0.75 (0.81) 1.24 (0.65) 1.37 (1.33) countryUKR -1.37 (1.01) -1.77 (0.80)* -3.71 (1.66)* countryUSA 0.76 (0.78) -0.08 (0.62) 2.58 (1.26)* R2 0.09 0.10 0.24 Adj. R2 0.07 0.09 0.20 Num. obs. 1094 1824 401 RMSE 5.08 5.08 4.87

    The separate modeling shows that the average height of ice hockey players, that were born in 1964-1996 and participated in the world championships in 2001–2016, increased with the speed of 0.4 cm per decade for defenders, 0.7 cm – for forwards, and (!) 2.4 cm – for goalies. In three decades the average height of the goalkeepers increased by 7 cm!

    Finally, let’s compare these dynamics with those in the population.

    Compare to population

    Our previous results expose significant height differences between players of various nations. Thus, it is reasonable to compare ice hockey players’ height to the corresponding male population of their countries.

    For the data on the height of males in population in the corresponding nations I used the relevant scientific paper. I grabbed the data from the paper PDF using a nice little tool – tabula – and also deposited on figshare.

    # download the data from Hatton, T. J., & Bray, B. E. (2010). # Long run trends in the heights of European men, 19th–20th centuries. # Economics & Human Biology, 8(3), 405–413. # http://doi.org/10.1016/j.ehb.2010.03.001 # stable URL, copied data (https://dx.doi.org/10.6084/m9.figshare.3394795.v1) df_hb <- read.csv("https://ndownloader.figshare.com/files/5303878") df_hb <- df_hb %>% gather("country", "h_pop", 2:16) %>% mutate(period = paste(period)) %>% separate(period, c("t1", "t2"), sep = "/")%>% transmute(cohort = (as.numeric(t1)+as.numeric(t2))/2, country, h_pop) # calculate hockey players' cohort height averages for each country df_hoc <- dfu %>% group_by(country, cohort) %>% summarise(h_hp = mean(av.height)) %>% ungroup()

    Unfortunately, our dataset on ice hockey players intersects with the data on population only for 8 countries: Austria, Denmark, Finland, France, Germany, Italy, Norway, and Sweden.

    # countries in both data sets both_cnt <- levels(factor(df_hb$country))[which(levels(factor(df_hb$country)) %in% levels(df_hoc$country))] both_cnt gg_hoc_vs_pop <- ggplot()+ geom_path(data = df_hb %>% filter(country %in% both_cnt), aes(x = cohort, y = h_pop), color = brbg11[9], size = 1)+ geom_point(data = df_hb %>% filter(country %in% both_cnt), aes(x = cohort, y = h_pop), color = brbg11[9], size = 2)+ geom_point(data = df_hb %>% filter(country %in% both_cnt), aes(x = cohort, y = h_pop), color = "white", size = 1.5)+ geom_point(data = df_hoc %>% filter(country %in% both_cnt), aes(x = cohort, y = h_hp), color = brbg11[3], size = 2, pch = 18)+ stat_smooth(data = df_hoc %>% filter(country %in% both_cnt), aes(x = cohort, y = h_hp), method = "lm", se = F, color = brbg11[1], size = 1)+ facet_wrap(~country, ncol = 2)+ labs(y = "height, cm", x = "birth cohort")+ theme_few(base_size = 20, base_family = "mono")+ theme(panel.grid = element_line(colour = "grey75", size = .25))

    Figure 6. The comparison of height dynamics in ice hockey players (brown) and the corresponding male populations (green)

    In all the analyzed countries, ice hockey players are 2-5 cm higher that the nation’s average. This is not very surprising since we expect some selection in sport. What is more interesting, in the developed countries the rapid increase in the height of males mostly leveled off in the birth cohorts of 1960s. Unlike the population trend, the height of ice hockey players continued to increase with roughly the same pace in all the analyzed countries except for Denmark.

    For the cohorts of Europeans that were born in first half of 20-th century, the height of males increased by 1.18–1.74 cm per decade (Figure 7, middle panel). Starting from the birth cohorts of 1960s, the pace decreased to 0.15–0.80 per decade.

    # growth in population df_hb_w <- df_hb %>% spread(cohort, h_pop) names(df_hb_w)[2:26] <- paste("y", names(df_hb_w)[2:26]) diffs <- df_hb_w[, 3:26]-df_hb_w[, 2:25] df_hb_gr<- df_hb_w %>% transmute(country, gr_1961_1980 = unname(apply(diffs[, 22:24], 1, mean, na.rm = T))*2, gr_1901_1960 = unname(apply(diffs[, 9:21], 1, mean, na.rm = T))*2, gr_1856_1900 = unname(apply(diffs[, 1:8], 1, mean, na.rm = T))*2) %>% gather("period", "average_growth", 2:4) %>% filter(country %in% both_cnt) %>% mutate(country = factor(country, levels = rev(levels(factor(country)))), period = factor(period, labels = c("1856-1900", "1901-1960", "1961-1980"))) gg_hb_growth <- ggplot(df_hb_gr, aes(x = average_growth, y = country))+ geom_point(aes(color = period), size = 3)+ scale_color_manual(values = brbg11[c(8, 3, 10)])+ scale_x_continuous(limits = c(0, 2.15))+ facet_wrap(~period)+ theme_few()+ xlab("average growth in men's height over 10 years, cm")+ ylab(NULL)+ theme_few(base_size = 20, base_family = "mono")+ theme(legend.position = "none", panel.grid = element_line(colour = "grey75", size = .25))

    Figure 7. Average changes in male population

    The height increase for ice hockey players seems quite impressive if we compare it to the stagnating dynamics in the corresponding male populations. And the acceleration of goalkeepers’ height is outright amazing.

    The diverging trends in the height of ice hockey players and normal population is likely to be driven by the strengthening selection in sport.

    Selection in ice hockey

    Looking through the literature on the selection in sport, I saw the finding that showed a notable disproportion of professional sportsmen by the month of birth. There are much more sportsmen that were born in the first half of the year. They have a lasting advantage since the kids teams are usually formed by birth cohorts. Thus, those born earlier in the year always have a bit more time lived compared to their later born team mates, which means that they are physically more mature. It is easy to test the finding on our ice hockey players dataset.

    # check if there are more players born in earlier months df_month <- df %>% mutate(month = month(birth)) %>% mutate(month = factor(month)) gg_month <- ggplot(df_month, aes(x = factor(month)))+ geom_bar(stat = "count", fill = brbg11[8])+ scale_x_discrete(breaks = 1:12, labels = month.abb)+ labs(x = "month of birth", y = "# of players")+ theme_few(base_size = 20, base_family = "mono")+ theme(legend.position = "none", panel.grid = element_line(colour = "grey75", size = .25))

    Figure 8. The distribution of ice hockey players by month of birth

    True, the distribution is notably skewed – there are much more players born in earlier months. When I further split the dataset by the decades of birth, it becomes clear that the effect becomes more evident with time (Figure 9). Indirectly, that means that the selection in ice hockey becomes tougher.

    # facet by decades df_month_dec <- df_month %>% mutate(dec = substr(paste(cohort), 3, 3) %>% factor(labels = paste("born in", c("1960s", "1970s", "1980s", "1990s")))) gg_month_dec <- ggplot(df_month_dec, aes(x = factor(month)))+ geom_bar(stat = "count", fill = brbg11[8])+ scale_x_discrete(breaks = 1:12, labels = month.abb)+ labs(x = "month of birth", y = "# of players")+ facet_wrap(~dec, ncol = 2, scales = "free")+ theme_few(base_size = 20, base_family = "mono")+ theme(legend.position = "none", panel.grid = element_line(colour = "grey75", size = .25))

    Figure 9. The distribution of ice hockey players by month of birth – separately by decades of birth

    Reproducibility

    The full R script can be downloaded here.

    To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/evolution-of-ice-hockey-players-height-iihf-world-championships-2001-2016/\';",30500);
    Categories: Methodology Blogs

    R Weekly Bulletin Vol – VII

    Sat, 2017-05-06 13:26

    (This article was first published on R programming, and kindly contributed to R-bloggers)

    This week’s R bulletin will cover topics like how to create a sequence of dates, how to add a number to a date and time variable and converting a date in American format to a standard format.

    We will also cover functions like the identical function, all.equal function, any and all functions. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

    Shortcut Keys

    1. Run current line/selection – Ctrl+Enter
    2. Move cursor to beginning of line – Home key
    3. Move cursor to end of line – End key

    Problem Solving Ideas Creating a sequence of dates using the seq.Date function

    The seq.Date function can be used for generating date sequences. The function works on dates with the class, “Date”. For dates with POSIX class, one can use the seq.POSIXt function for the same. The syntax for the seq.Date function is given as:

    seq.Date(from, to, by)

    Where,
    “from” is the starting date.
    “to” is the end date. This can be an optional argument.
    “by” is the increment of the sequence.
    “length.out” takes an integer value; this is an optional argument, and indicates the desired length of
    the sequence

    Examples:

    # by month seq(as.Date("2016/1/1"), by = "month", length.out = 12)

    [1] “2016-01-01” “2016-02-01” “2016-03-01” “2016-04-01” “2016-05-01”
    [6] “2016-06-01” “2016-07-01” “2016-08-01” “2016-09-01” “2016-10-01”
    [11] “2016-11-01” “2016-12-01”

    # by quarters. seq(as.Date("2000/1/1"), as.Date("2003/1/1"), by = "quarter")

    [1] “2000-01-01” “2000-04-01” “2000-07-01” “2000-10-01” “2001-01-01”
    [6] “2001-04-01” “2001-07-01” “2001-10-01” “2002-01-01” “2002-04-01”
    [11] “2002-07-01” “2002-10-01” “2003-01-01”

    # If we specify a value in the 'by' argument, it will create a sequence of # dates which are apart from that given value. seq(as.Date("2000/1/1"), as.Date("2003/1/1"), by = "2 quarter")

    [1] “2000-01-01” “2000-07-01” “2001-01-01” “2001-07-01” “2002-01-01”
    [6] “2002-07-01” “2003-01-01”

    Adding a number to a date & time variable

    There are three date & time classes in R: POSIXct, POSIXlt, and Date. When we add a number to a POSIXct or POSIXlt class date, it takes the unit of the given number as seconds, and as a result it shifts the date by those many seconds.

    Example:

    now_time = Sys.time() print(now_time)

    [1] “2017-05-06 21:01:19 IST”

    # Let us add a value of 1 to the now_time date variable. x = now_time + 1 print(x)

    [1] “2017-05-06 21:01:20 IST”

    As can be seen from the output, the date shifts by 1 second. If we want to add 1 day to the “now_time” variable, we will have to add 86400 to the “now_time” variable as 1 day is equivalent to 86400 seconds. This will shift it forward by 1 day.

    now_time = Sys.time() y = now_time + 86400 print(y)

    [1] “2017-05-07 21:01:20 IST”

    However, if the date is stored as a Date class, then adding a value of 1 will shift it forward by 1 day. Thus, adding a number to a date variable in the form of Date class will shift it by that many days.

    now_time = as.Date(Sys.time()) print(now_time)

    [1] “2017-05-06”

    x = now_time + 1 print(x)

    [1] “2017-05-07”

    Converting American date format to standard format

    The American date format is of the type mm/dd/yyyy, whereas the ISO 8601 standard format is yyyy-mm-dd. To convert an American date format to the standard format we will use the as.Date function along with the format function. The example below illustrates the method.

    Example:

    # date in American format dt = "07/24/2016" # If we call the as.Date function on the date, it will throw up an error, as # the default format assumed by the as.Date function is yyyy-mmm-dd. as.Date(dt)

    Error in charToDate(x): character string is not in a standard unambiguous format

    # Correct way of formatting the date as.Date(dt, format = "%m/%d/%Y")

    [1] “2016-07-24”

    Functions Demystified identical function

    The identical function tests whether two objects in R are exactly equal. The objects to be compared
    are included as the arguments to the function, and the function returns a logical TRUE/FALSE as
    the output.

    Examples:

    y1 = c(1:12) y2 = c(1:12) identical(y1, y2)

    [1] TRUE

    days = c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun") months = c("Jan", "Feb", "Mar", "April", "May", "June", "July") identical(days, months)

    [1] FALSE

    all.equal function

    The all.equal function is used for checking equality of numbers. If the values to be compared are not the same, the all.equal function returns a report on the differences. The syntax for the function is given as:

    all.equal(target, current)

    where,
    target is an R object.
    current is other R object, to be compared with target.

    Example:

    all.equal(3, 2)

    [1] “Mean relative difference: 0.3333333”

    # Using is.TRUE will return a logical value instead of the differences # report. isTRUE(all.equal(3, 2))

    [1] FALSE

    any and all functions

    The “any” function is used to check whether there is any true value in a given a set of logical vectors. The “all” function is another useful function which returns a TRUE if the all the values in the input logical vector are true.

    Example 1:

    sample = c(FALSE, FALSE, FALSE) any(sample)

    [1] FALSE

    all(sample)

    [1] FALSE

    Example 2:

    sample = c(TRUE, FALSE, FALSE) any(sample)

    [1] TRUE

    all(sample)

    [1] FALSE

    Next Step

    We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

    Download the PDF Now!

    The post R Weekly Bulletin Vol – VII appeared first on .

    To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/r-weekly-bulletin-vol-vii/\';",30500);
    Categories: Methodology Blogs

    MIT Step by Step Instructions for Creating Your Own R Package.

    Sat, 2017-05-06 13:11

    (This article was first published on Data R Value, and kindly contributed to R-bloggers)

    We will mention the steps to create your own R package using RStudio and we will provide the link to download the complete MIT guide:”Instructions for Creating Your Own R Package“, In Song kim, Phil Martin, Nina McMurry. February 23, 2016 .

    1. Start by opening a new .R file. Make sure your default directory is clear by typing rm(list= ls()). Check to see that it is empty using ls() (you should see character(0)).

    2. Write the code for your functions in this .R file. You can create one file with all of your functions or create separate files for each function. Save these files somewhere where you can easily find them.



    3. Install the ‘devtools’ package (install.packages(‘devtools’)).

    4. Open a new project in RStudio. Go to the ‘File’ menu and click on ‘New Project.’ Then select ‘New Directory,’ and ‘R Package’ to create a new R package.



    5. Type the name of your package, then upload the .R file you created in step 1 under ‘Create package based on source files’. Click ‘Create project.’



    6. On the lower right hand side of your screen, you should see a file directory. The ‘R’ folder contains the code for your functions. The ‘man’ folder will contain the help files for each function in your package. Depending on your version of RStudio, the help files may have been generated automatically as .Rd or “R documentation” files when you created your package.
    If the ‘man’ folder already contains .Rd files, open each file, add a title under the ‘title’ heading, and save (if not, see step 7). You can go back and edit the content later, but you will need to add a title to each .Rd file in order to compile your package.



    7. If your ‘man’ file is empty, you will have to manually create a .Rd file for each function. To do this, go to File > New File > R Documentation, enter the title of the function and select ‘Function’ under the ‘Rd template’ menu. Edit your new file to include something in the ‘title’ field (again, you may make other edits now or go back and make edits later, but your package will not compile if the ‘title’ field is empty). Save each .Rd file in the ‘man’ folder.
    NOTE: You will need to complete this step if you add more functions to your package at a later point, even if RStudio automatically generated R documentation files when you initially created the package.



    8. Now you are ready to compile your package. Go to ‘Build’ on the top toolbar and select ‘Build and Reload’ (note you can also use the keyboard shortcut Ctrl+Shift+B). If this works, your package will automatically load and you will see library(mynewpackage) at the bottom of your console. Test your functions to make sure they work.



    9. Go back and edit the documentation (the help file) for each function. Open each .Rd file, add a brief description of the package, define its arguments and, if applicable, values, and include at least one example. Then, re-compile your package and test out your documentation in the R console (?myfun). NOTE: You will need to re-compile (repeating step 8) each time you make changes to your functions or documentation.



    10. Once you have finished creating your functions and documentation, compiled your package, and double checked that the functions and help files work, copy the entire folder containing your package to the Dropbox folder with your name on it.


    Download the complete guide at:
    https://github.com/pakinja/Data-R-Value

    To leave a comment for the author, please follow the link and comment on their blog: Data R Value. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    setTimeout("location.href = \'https://www.r-bloggers.com/mit-step-by-step-instructions-for-creating-your-own-r-package/\';",30500);
    Categories: Methodology Blogs