# R bloggers

### ODSC West 2016 – 20% off discount code for training with leading R experts

*(Guest post by ODSC West Team)
*

These are good days to be an R programmer. There’s been a slew of big players announcing R integration including IBM and Microsoft. These and similar announcements are accelerating R’s move out of the lab and into the enterprise. Couple this development with the fact that demand for data scientists is still white hot especially for with programming skills in languages like R. The Open Data Science Conference (ODSC) in Santa Clara on November, 4th – 6th is there to help you accelerate your R learning by covering a lot of ground on a lot topics quickly. This year’s ODSC conference includes more than 90 talks and 50 workshops on a wide range of data science tools, topics and languages, including R. Here are some great reasons why thousands of R enthusiasts, just like you, will be attending ODSC West 2016.

ODSC conferences are well known for their high-level speakers. Presenters are hand-picked and only the best and brightest in the data science field make the cut. That’s why we invited two of the top R experts this year to train you.

Jared Lander, *Author of R for Everyone, *will be giving 2 premium training sessions:

- Modeling & Analytics in R
- Machine Learning Tools in R

Also, don’t miss your chance to learn and train with one of the best known names in R!

- Joseph Rickert, R-Studio R Embassador

If you are a beginner, software engineer or just interested in data science, we have a fantastic 4-hour training session just for you. This workshop is specifically designed to help beginners build the foundation necessary to become a data scientist.

Want more intro sessions? Check out this ODSC workshop.

Have experience in the field already? Check out these advanced workshops in R. This is a great opportunity to train and network with highly-experienced data science practitioners.

- Prasad Saripalli: ML Use Cases Workshop Session with R
- Stuart Bailey: AnalyticOps – How to Develop Mission Critical Analytic Deployment Capabilities
- John Mount & Nina Zumel: A Unified View of Model Evaluation

In addition to all of the learning you’ll be doing, the connections you will make at this conference will help you solve your next problem or find your next project. Even land your first or next job at the ODSC Career Fair, featuring 50 leading tech companies, filling over 400 positions.

We’re now expecting more than 3,000 data scientists, software engineers, ML//DL experts, CTO’s, and others for 3 full days of training, learning and networking with some of the top names in data science at ODSC in Santa Clara. We hope to see you there too.

The conference is just weeks away and tickets are going fast. **Use discount code: ODSC-RBLOG** for an extra 20% Off our already low prices.

Redeem discount here

See you in Santa Clara,

ODSC West Team

### Statistical Reading Rainbow

(This article was first published on ** R – Mathew Analytics**, and kindly contributed to R-bloggers)

For those of us who received statistical training outside of statistics departments, it often emphasized procedures over principles. This entailed that we learned about various statistical techniques and how to perform analysis in a particular statistical software, but glossed over the mechanisms and mathematical statistics underlying these practices. While that training methodology (hereby referred to as the ‘heuristic method’) has value, it has many drawbacks when the ultimate goal is to perform sound statistical analysis that is valid and thorough. Even in my current role as a data scientist at a technology company in the San Francisco Bay Area, I have had to go back and understand various procedures and metrics instead of just “doing data analysis”.

Given this realization, I have dedicated hours of time outside of work over the last couple years to “re-training” myself on many of the important concepts in both descriptive and inferential statistics. This post will give brief mention to the books that have been most useful is helping me develop a fuller understanding of the statistical sciences. These books have also helped me fill in the gaps and deficiencies from my statistical training in university and graduate school. Furthermore, these are the texts that I often revisit when I need a reference on some statistical topic of interest. This is at a minimum a a six year journey, so I have a long way to go until I am able to stand solidly in my understanding of statistics. While I am sacrificing a lot of my free time to this undertaking, it will certainly improve my knowledge and help prepare me for graduate school (PhD) in biostatistics, which I hope to attend in around five years.[1]

Please note that I am not taking issue with the ‘heuristic method’ of statistical training. It certainly has its place and provides students with the immediate knowledge required to satisfactorily prepare for work in private industry. In fact, I prefer the ‘heuristic method’ and still rely on straight forward rules in my day to day work as that ensures that best practices are followed and satisfactory analysis is performed. Furthermore, I certainly believe that it is superior to the hack-ey nature of data mining and data science education, but that is a different story.

Fundamentals:Statistics in Plain English – Urdan

Clear, concise, and covers all the fundamental items that one would need to know. Everything from descriptive statistics to linear regression are covered, with many good examples. Even if you never use ANOVA or factor analysis, this is a good book to review and one that I strongly recommend to people who are interested in data science.

Principles of Statistics – Balmer

This is a classic text that offers a good treatment of probability theory, distributions, and statistical inference. The text contains a bit more math than ‘Statistics in Plain English’, so I think it should be read after completing the previous book.

Fundamentals of Modern Statistical Methods – Wilcox

This book reviews ‘traditional’ parametric statistics and provides a good overview of robust statistical methods. There is a fair amount on the historical evolution of various techniques, and I found that a bit unnecessary. But overall, this is still a solid introductory text to learn about statistical inference using robust techniques.

Mostly Harmless Econometrics – Angrist

While I don’t regularly work with instrumental variables, generalized methods of moments, or regression discontinuity, this book is a great high level introduction to econometrics. The chapters on regression theory and quantile regression are phenomenal.

Regression Modeling Strategies – Harrell

This is my most referenced book and the one that really helped in my overall development as an applied statistician. All the important topics are covered, from imputation, regression splines, and so forth. This book includes R code for performing analysis using the RMS package. I end up citing this book quite a lot. For example, in a recent work email, I mentioned that Harrell “also says on page 61 that “narrowly distributed predictor variables will require higher sample sizes.”” Essential reading in my opinion.

Data Analysis Using Regression and Multilevel/Hierarchical Models – Gelman and Hill

The first half of this book cover statistical inference using single level models and the second half is dedicated to multilevel methods. Given that I am rarely work with panel data, I use the first half of this book a reference for things that I may need a quick refresher on. It is very accessible and has plenty of examples with R code.

Semiparametric Regression for the Social Sciences – Keele

This is one of my favorite statistical books. Well written and easy to comprehend, but still rigorous. Covers local regression, splines, and generalized additive models. There is also a solid chapter on the use of bootstrapping with semiparametric and nonparametric models.

Statistical Learning from a Regression Perspective – Berk

As a skeptic who is wary of every hype machine, I really enjoyed Berks preface in which he discusses the “dizzying array of new statistical procedures” that have been introduced over the past several decades with “the hype of a big-budget movie.” I got this text for its treatment of topics such as boosting, bagging, random forest, and support vector machines. I will probably need to reread this book several more times before I fully comprehend everything.

Time Series: a Biostatistical Introduction – Diggle

The lack of quality time series books is really infuriating. Don’t get me wrong, there are some good texts on forecasting, such as the free online book from Hyndaman. However, I’ve yet to find a really good intermediate level treatment of time series analysis besides this one. Contains good coverage of repeated measurements, ARIMA modeling, and forecasting.

Statistical Rethinking – McElreath

While I was introduced to robust techniques and nonparametric statistics in graduate school, there was nothing on Bayesian methods. Due to a fear of the topic, I avoided learning about it until this past year. This book by McElreath has been great as it is very accessible and provides code for understanding various principles. Over the next year, I am hoping to dive deeper into Bayesian techniques and this was a good first step.

[1] If you are an academic or doctoral students in a statistical field and are looking for a part-time research assistant, please contact me at mathewanalytics@gmail.com. I’m looking to gain as much research experience as possible before entering a doctoral program.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Mathew Analytics**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### tractable Bayesian variable selection: beyond normality

(This article was first published on ** R – Xi'an's Og**, and kindly contributed to R-bloggers)

**D**avid Rossell and Francisco Rubio (both from Warwick) arXived a month ago a paper on non-normal variable selection. They use two-piece error models that preserve manageable inference and allow for simple computational algorithms, but also characterise the behaviour of the resulting variable selection process under model misspecification. Interestingly, they show that the existence of asymmetries or heavy tails leads to power losses when using the Normal model. The two-piece error distribution is made of two halves of location-scale transforms of the same reference density on the two sides of the common location parameter. In this paper, the density is either Gaussian or Laplace (i.e., exponential?). In both cases the (log-)likelihood has a nice compact expression (although it does not allow for a useful sufficient statistic). One is the L¹ version versus the other which is the L² version. Which is the main reason for using this formalism based on only two families of parametric distributions, I presume. (As mentioned in an earlier post, I do not consider those distributions as mixtures because the component of a given observation can always be identified. And because as shown in the current paper, maximum likelihood estimates can be easily derived.) The prior construction follows the non-local prior principles of Johnson and Rossell (2010, 2012) also discussed in earlier posts. The construction is very detailed and hence highlights how many calibration steps are needed in the process.

“Bayes factor rates are the same as when the correct model is assumed [but] model misspecification often causes a decrease in the power to detect truly active variables.”

When there are too many models to compare at once, the authors propose a random walk on the finite set of models (which does not require advanced measure-theoretic tools like reversible jump MCMC). One interesting aspect is that moving away from the normal to another member of this small family is driven by the density of the data under the marginal densities, which means moving only to interesting alternatives. But also sticking to the normal only for adequate datasets. In a sense this is not extremely surprising given that the marginal likelihoods (model-wise) are available. It is also interesting that on real datasets, one of the four models is heavily favoured against the others, be it Normal (6.3) or Laplace (6.4). And that the four model framework returns almost identical values when compared with a single (most likely) model. Although not immensely surprising when acknowledging that the frequency of the most likely model is 0.998 and 0.998, respectively.

“Our framework represents a middle-ground to add flexibility in a parsimonious manner that remains analytically and computationally tractable, facilitating applications where either p is large or n is too moderate to fit more flexible models accurately.”

Overall, I find the experiment quite conclusive and do not object [much] to this choice of parametric family in that it is always more general and generic than the sempiternal Gaussian model. That we picked in our Bayesian Essentials, following tradition. In a sense, it would be natural to pick the most general possible parametric family that allows for fast computations, if this notion does make any sense…

Filed under: R, Statistics, University life Tagged: Bayesian Essentials with R, calibration, marginal density, maximum likelihood estimation, parametric family, R, two-piece error model, University of Warwick

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Xi'an's Og**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Rcpp now used by 800 CRAN packages

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

A moment ago, Rcpp hit another milestone: 800 packages on CRAN now depend on it (as measured by Depends, Imports and LinkingTo declarations). The graph is on the left depicts the growth of Rcpp usage over time.

The easiest way to compute this is to use the reverse_dependencies_with_maintainers() function from a helper scripts file on CRAN. This still gets one or *false positives* of packages declaring a dependency but not actually containing C++ code and the like. There is also a helper function revdep() in the devtools package but it includes Suggests: which does not firmly imply usage, and hence inflates the count. I have always opted for a tighter count with corrections.

Rcpp cleared 300 packages in November 2014. It passed 400 packages in June of last year (when I only tweeted about it), 500 packages less than a year ago in late October, 600 packages this March and 700 packages this July. The chart extends to the very beginning via manually compiled data from CRANberries and checked with crandb. The next part uses manually saved entries. The core (and by far largest) part of the data set was generated semi-automatically via a short script appending updates to a small file-based backend. A list of packages using Rcpp is kept on this page.

Also displayed in the graph is the relative proportion of CRAN packages using Rcpp. The four per-cent hurdle was cleared just before useR! 2014 where I showed a similar graph (as two distinct graphs) in my invited talk. We passed five percent in December of 2014, six percent July of last year, seven percent just before Christmas and eight percent this summer.

800 user packages is staggeringly large and humbling number. This puts more than some responsibility on us in the Rcpp team as we continue to keep Rcpp as performant *and* reliable as it has been.

At the rate we are going, the big 1000 may be hit before we all meet again for useR! 2017.

And with that a very big **Thank You!** to all users and contributors of Rcpp for help, suggestions, bug reports, documentation or, of course, code.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The R-Studio Founder, Debate Language and Who is the most Active Data Scientist?

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

To stay on top of R in the news, we’re sharing some stories related to R published last week.

A great interview with JJ Allaire, creator of RStudio.(Joseph Rickert)The man who build RStudio now 13 years ago shares some insight on the company and his own motivation. Or was it a company? we are still not sure JJ!

The language of the second presidential debate(Edward Lee)I do love some text analysis, but Trump is all about that money… When will data scientists start coaching presidential candidates to look good on post-debate word clouds?

The most Active Data-Scientists online (Manish Saraswat)Hadley Wickham standing proud! But it does look like the R-community can take steps in their online visibility.

Alright that was it for this week, we will keep uploading sets this week and come back with another update as scheduled. To stay on top of the news during the week, visit our R News section.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### tint 0.0.3: Tint Is Not Tufte

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

The tint package, whose name stands for *Tint Is Not Tufte* , on CRAN offers a fresh take on the excellent Tufte-style for html and pdf presentations.

It marks a milestone for me: I finally have a repository with more "stars" than commits. Gotta keep an eye on the big prize…

Kidding aside, and as a little teaser, here is what the pdf variant looks like:

This release corrects one minor misfeature in the pdf variant. It also adds some spit and polish throughout, including a new NEWS.Rd file. We quote from it the entries for the current as well as previous releases:

Changes in tint version 0.0.3 (2016-10-15)Changes in tint version 0.0.2 (2016-10-06)

Correct pdf mode to no use italics in table of contents (#9 fixing #8); also added color support for links etc

Added (basic) Travis CI support (#10)

Changes in tint version 0.0.1 (2016-09-24)

In html mode, correct use of italics and bold

Html function renamed to tintHtml Roboto fonts with (string) formats and locales; this allow for adding formats; (PRs #6 and #7)

Added pdf mode with new function tintPdf(); added appropriate resources (PR #5)

Updated resource files

- Initial (non-CRAN) release to ghrr drat

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R code to accompany Real-World Machine Learning (Chapter 3)

(This article was first published on ** data prone - R**, and kindly contributed to R-bloggers)

The rwml-R Github repo is updated with R code to accompany Chapter 3 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf.

Survivors on the TitanicThe Titanic Passengers dataset is used to illustrate various processes used

to prepare data for modeling, including

conversion of factor variables to dummy variables. For example, the code

to produce the

following table of processed data is provided:

I also go “off-script” a bit (do some things not contained in the book) and

demonstrate some useful visualization, modeling, and performance

measuring techniques available with the

caret and AppliedPredictiveModeling packages.

A k-nearest neighbors classifier (from the kknn package) is used to

predict the numbers represented in the MNIST database of handwritten digits.

Examples of the types of digits present in the dataset and the R code to

display them:

Auto MPG dataset

As an example of a linear regression analysis, the Auto MPG dataset introduced

in Chapter 2 resurfaces and fuel economy is predicted from origin, year of

production, and performance characteristics such as horsepower and engine

displacement.

As always, I’d love to hear from you if you find the project helpful or if you

have any suggestions. Please leave a comment below or use the Tweet button.

Also, feel free to fork the rwml-R repo

and submit a pull request if you wish to contribute.

To **leave a comment** for the author, please follow the link and comment on their blog: ** data prone - R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Visualizing ROC Curves in R using Plotly

(This article was first published on ** R – Modern Data**, and kindly contributed to R-bloggers)

In this post we’ll create some simple functions to generate and chart a Receiver Operator (ROC) curve and visualize it using Plotly. See Carson’s plotly book for more details around changes in syntax.

We’ll do this from a credit risk perspective i.e. validating a bank’s internal rating model (we’ll create a sample dataset keeping this in mind)

We’ll replicate computations highlighted in this paper.

library(plotly) library(dplyr) library(flux) Sample data set.seed(123) n <- 100000 lowest.rating <- 10 # Sample internal ratings # Say we have a rating scale of 1 to 10̥ ratings <- sample(1:lowest.rating, size = n, replace = T) # Defaults # We'll randomly assign defaults concentrating more defaults # in the lower rating ranges. We'll do this by creating exponentially # increasing PDs across the rating range power <- 5 PD <- log(1:lowest.rating) PD <- PD ^ power #PD <- exp((1:lowest.rating)) PD <- PD/(max(PD) * 1.2) # increased denominator to make the PDs more realistic Now given PD for eac rating category sample from a binomial distribution # to assign actual defaults defaults <- rep(0, n) k <- 1 for(i in ratings){ defaults[k] <- rbinom(1, 1, PD[i]) k <- k + 1 } dataset <- data.frame(Rating = ratings, Default = defaults) # Check if dataset looks realistic̥ # df <- dataset %>% # group_by(Rating) %>% # summarize(Def = sum(Default == 1), nDef = sum(Default == 0)) ROC Curve ComputationNow that we have a sample dataset to work with we can start to create the ROC curve

ROCFunc <- function(cutoff, df){ # Function counts the number of defaults hap̥pening in all the rating # buckets less than or equal to the cutoff # Number of hits = number of defaults with rating < cutoff / total defaults # Number of false alarms = number ofnon defaults with rating < cutoff / total non defaults nDefault <- sum(df$Default == 1) notDefault <- sum(df$Default == 0) temp <- df %>% filter(Rating >= cutoff) hits <- sum(temp$Default == 1)/nDefault falsealarm <- sum(temp$Default == 0)/notDefault ret <- matrix(c(hits, falsealarm), nrow = 1) colnames(ret) <- c("Hits", "Falsealarm") return(ret) } # Arrange ratings in decreasing order # A lower rating is better than a higher rating vec <- sort(unique(ratings), decreasing = T) ROC.df <- data.frame() for(i in vec){ ROC.df <- rbind(ROC.df, ROCFunc(i, dataset)) } # Last row to complete polygon labels <- data.frame(x = ROC.df$Falsealarm, y = ROC.df$Hits, text = vec) ROC.df <- rbind(c(0,0), ROC.df) # Area under curve AUC <- round(auc(ROC.df$Falsealarm, ROC.df$Hits),3) Plot plot_ly(ROC.df, y = ~Hits, x = ~Falsealarm, hoverinfo = "none") %>% add_lines(name = "Model", line = list(shape = "spline", color = "#737373", width = 7), fill = "tozeroy", fillcolor = "#2A3356") %>% add_annotations(y = labels$y, x = labels$x, text = labels$text, ax = 20, ay = 20, arrowcolor = "white", arrowhead = 3, font = list(color = "white")) %>% add_segments(x = 0, y = 0, xend = 1, yend = 1, line = list(dash = "7px", color = "#F35B25", width = 4), name = "Random") %>% add_segments(x = 0, y = 0, xend = 0, yend = 1, line = list(dash = "10px", color = "black", width = 4), showlegend = F) %>% add_segments(x = 0, y = 1, xend = 1, yend = 1, line = list(dash = "10px", color = "black", width = 4), showlegend = F) %>% add_annotations(x = 0.8, y = 0.2, showarrow = F, text = paste0("Area Under Curve: ", AUC), font = list(family = "serif", size = 18, color = "#E8E2E2")) %>% add_annotations(x = 0, y = 1, showarrow = F, xanchor = "left", xref = "paper", yref = "paper", text = paste0("Receiver Operator Curve"), font = list(family = "arial", size = 30, color = "#595959")) %>% add_annotations(x = 0, y = 0.95, showarrow = F, xanchor = "left", xref = "paper", yref = "paper", text = paste0("Charts the percentage of correctly identified defaults (hits) against the percentage of non defaults incorrectly identifed as defaults (false alarms)"), font = list(family = "serif", size = 14, color = "#999999")) %>% layout(xaxis = list(range = c(0,1), zeroline = F, showgrid = F, title = "Number of False Alarms"), yaxis = list(range = c(0,1), zeroline = F, showgrid = F, domain = c(0, 0.9), title = "Number of Hits"), plot_bgcolor = "#E8E2E2", height = 800, width = 1024)
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Modern Data**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Grammar of Graphics and Radar Charts

(This article was first published on ** R-Chart**, and kindly contributed to R-bloggers)

Although this is an accurate description, it does not express the design and structure of a chart in a way that relates it to other types of charts. The grammar of graphics embodied by ggplot2 provides not only a way of representing such a chart, but also utilizes a syntax that can help one compare it to other types of charts. A radar chart might be described using ggplot2 terminology as being

*a line chart with a completed path plotted using a polar rather than a cartesian coordinate system.*

The components of this definition map to ggplot functions:

- The phrase “a line chart” suggests the geom_line() function. We can do our first exploratory plots using a line chart and modify the results incrementally.
- The phrase “with a completed path” indicates that a line is not sufficient. We need to ensure the beginning and end meet. The geom_polygon() function can be used to generate the outline of a polygon to accomplish this.
- The final requirement is that the chart be “plotted using a polar rather than a cartesian coordinate system”. The coord_polar() function accomplishes this transformation.

Throughout this post, the equals assignment operator will be used instead of “less-than-dash” because blogger mangles this character sequence. I prefer the latter stylistically. The code in this post is available in a script at GitHub as well. Start by importing the following packages.**library(dplyr)**
**library(ggplot2)**
**library(scales)**
**library(reshape2)**
**library(tibble)**

Dplyr and reshape are used to structure and filter the dataset. The scales package is used to normalize data values for convenient comparison. The tibble package can be imported to provide convenient viewing and utility functions when working through dplyr pipelines, and ggplot2 is used to create the charts.

Create a data frame based on the mtcarts dataset. The rownames_to_column function from the tibble package is used to create a new column named “cart” based on the rownames. The rescale function from the scales package transforms all numeric variables in the dataset so that they have comparable values (between 0 and 1). The melt function from the reshape package creates name and value columns and populates them with the names and values that previously were independent columns. The data set is now “longer” and not tidy, but the format is useful for the plots created later. The data set is then ordered by the name of the car.
**df = mtcars %>%**
** rownames_to_column( var = “car” ) %>% **
** mutate_each(funs(rescale), -car) %>% **
** melt(id.vars=c(‘car’), measure.vars=colnames(mtcars)) %>% **
** arrange(car)**

A radar chart is really just a line plot altered to be charted in an alternative coordinate system.
**line_plot = df %>%**
** filter(variable==’mpg’) %>%**
** ggplot(aes(x=car, y=value, group=1)) + **
** geom_line(color = ‘purple’)**
**print(line_plot)**

Compare the line plot above with the same plot modified to use polar coordinates.
**print(line_plot + coord_polar())**

The result is not exactly a radar chart, there is a gap where the beginning and the end of the line were not connected. We need to connect the line so that there is a completed path with no beginning and endpoint. This can be accomplished using the geom_polygon instead of geom_line function. Since we only want an outline (not a filled polygon) we specify fill=NA as an argument. This looks a bit funny when plotted using a standard cartesian coordinate system.
**polygon_plot = df %>% **
** filter(variable==’mpg’) %>%**
** ggplot(aes(x=car, y=value, group=1)) + **
** geom_polygon(color = ‘purple’, fill=NA)**
**print(polygon_plot)**
The result rendered with polar coordinates now includes a completed path.

**print(polygon_plot + coord_polar())**

A bit of cleanup of the themes and label orientation can improve the presentation further.
**print(polygon_plot + coord_polar() + **
** theme_bw() + **
** theme(axis.text.x = **
** element_text(**
** vjust=50,**
** angle=-90 – 360 / length(unique(df$car)) * seq_along(df$car)**
** )**
** )**
**)**

One of the values of radar charts is that the result has a distinctive “shape” that can help to highlight certain patterns or similarities between results. Multifaceted charts can be used to quickly render a set of radar charts. This example plots all car data variables and facets by variable type.
**df %>%**
** ggplot(aes(x=car, y=value, group=variable, color=variable)) + **
** geom_polygon(fill=NA) + **
** coord_polar() + theme_bw() + facet_wrap(~ variable) + **
** #scale_x_discrete(labels = abbreviate) + **
** theme(axis.text.x = element_text(size = 3))**

This final example facets by car rather than variable type. Cars with related characteristics share a similar shape.
**df %>%**
** ggplot(aes(x=variable, y=value, group=car, color=car)) + **
** geom_polygon(fill=NA) + **
** coord_polar() + theme_bw() + facet_wrap(~ car)**

Ggplot2 not only renders good-looking charts… it enables you to reason about them based upon a well thought out API. Although radar plots are not currently included by default in ggplot2, if you reason about their structure a bit, you can “build up” a chart from the functions available. Pie charts are also not included in ggplot2 but can be constructed using a stacked bar chart and specifying an appropriate coordinate system. Ggplot2 provides a playground where you can train yourself to better understand the components of charts and how various chart types relates. Working with the package in this way will help you see that the “gg” in ggplot2 is as significant as the plotting itself.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-Chart**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Progress bar overhead comparisons

(This article was first published on ** Peter Solymos - R related posts**, and kindly contributed to R-bloggers)

As a testament to my obsession with progress bars in R, here is

a quick investigation about the overhead cost of

drawing a progress bar during computations in R.

I compared several approaches including

my **pbapply** and Hadley Wickham’s **plyr**.

Let’s compare the good old lapply function from base R,

a custom-made variant called lapply_pb that was

proposed here, l_ply from the **plyr** package,

and finally pblapply from the **pbapply** package:

Use the function f to run all four variants. The expected run time

is n * s (number of iterations x sleep duration),

therefore we can calculate the overhead from the

return objects as elapsed minus expected. Let’s get some numbers

for a variety of n values and replicated B times

to smooth out the variation:

The plot tells us that the overhead increases linearly

with the number of iterations when using lapply

without progress bar.

All other three approaches show similar patterns to each other

and the overhead is constant: lines are

parallel above 100 iterations after an initial increase.

The per iteration overhead is decreasing, approaching

the lapply line. Note that all the differences are tiny

and there is no practical consequence

for choosing one approach over the other in terms of processing times.

This is good news and another argument for using progress bar

because its usefulness far outweighs the minimal

(<2 seconds here for 1000 iterations)

overhead cost.

As always, suggestions and feature requests are welcome.

Leave a comment or visit the GitHub repo.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Peter Solymos - R related posts**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### anytime 0.0.3: Extension and fixes

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

anytime arrived on CRAN with releases 0.0.1 and 0.0.2 about a month ago. anytime aims to convert *anything* in integer, numeric, character, factor, ordered, … format to POSIXct (or Date) objects.

Release 0.0.3 brings a bugfix for Windows (where for dates before the epoch of 1970-01-01, accessing the tm_isdst field for daylight savings would *crash* the session) and a small (unexported) extension to test format strings. This last feature plays well the ability to add format strings which we added in 0.0.2.

The NEWS file summarises the release:

Changes in Rcpp version 0.0.3 (2016-10-13)

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Thinking inside the box **.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### On the ifelse function

(This article was first published on ** Florian Privé**, and kindly contributed to R-bloggers)

In this post, I will talk about the **ifelse** function, which behaviour can be easily misunderstood, as pointed out in my latest question on SO. I will try to show how it can be used, and misued. We will also check if it is as fast as we could expect from a vectorized base function of R.

The first example comes directly from the R documentation:

x <- c(6:-4) sqrt(x) #- gives warning ## Warning in sqrt(x): NaNs produced ## [1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000 NaN NaN ## [10] NaN NaN sqrt(ifelse(x >= 0, x, NA)) # no warning ## [1] 2.449490 2.236068 2.000000 1.732051 1.414214 1.000000 0.000000 NA NA ## [10] NA NASo, it can be used, for instance, to handle special cases, in a vectorized, succinct way.

The second example comes from the vignette of Rcpp Sugar:

foo <- function(x, y) { ifelse(x < y, x*x, -(y*y)) } foo(1:5, 5:1) ## [1] 1 4 -9 -4 -1So, it can be used to construct a vector, by doing an element-wise comparison of two vectors, and specifying a custom output for each comparison.

A last example, just for the pleasure:

(a <- matrix(1:9, 3, 3)) ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 ifelse(a %% 2 == 0, a, 0) ## [,1] [,2] [,3] ## [1,] 0 4 0 ## [2,] 2 0 8 ## [3,] 0 6 0 How can it be misused?I think many people think they can use ifelse as a shorter way of writing an if-then-else statement (this is a mistake I made). For example, I use:

legend.pos <- ifelse(is.top, ifelse(is.right, "topright", "topleft"), ifelse(is.right, "bottomright", "bottomleft"))instead of:

if (is.top) { if (is.right) { legend.pos <- "topright" } else { legend.pos <- "topleft" } } else { if (is.right) { legend.pos <- "bottomright" } else { legend.pos <- "bottomleft" } }That works, but this doesn’t:

ifelse(FALSE, 0, 1:5) ## [1] 1Indeed, if you read carefully the R documentation, you see that ifelse is returning a vector of the same length and attributes as the condition (here, of length 1).

If you really want to use a more succinct notation, you could use

`if`(FALSE, 0, 1:5) ## [1] 1 2 3 4 5If you’re not familiar with this notation, I suggest you read the chapter about functions in book *Advanced R*.

Consider the Rcpp Sugar example again, 4 means to compute it:

#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector fooRcpp(const NumericVector& x, const NumericVector& y) { int n = x.size(); NumericVector res(n); double x_, y_; for (int i = 0; i < n; i++) { x_ = x[i]; y_ = y[i]; if (x_ < y_) { res[i] = x_*x_; } else { res[i] = -(y_*y_); } } return res; } fooRcpp(1:5, 5:1) ## [1] 1 4 -9 -4 -1 #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector fooRcppSugar(const NumericVector& x, const NumericVector& y) { return ifelse(x < y, x*x, -(y*y)); } fooRcppSugar(1:5, 5:1) ## [1] 1 4 -9 -4 -1 foo2 <- function(x, y) { cond <- (x < y) cond * x^2 - (1 - cond) * y^2 } foo2(1:5, 5:1) ## [1] 1 4 -9 -4 -1 x <- rnorm(1e4) y <- rnorm(1e4) print(microbenchmark( foo(x, y), foo2(x, y), fooRcpp(x, y), fooRcppSugar(x, y) )) ## Unit: microseconds ## expr min lq mean median uq max neval ## foo(x, y) 510.535 542.6510 872.23474 563.510 716.9680 2439.447 100 ## foo2(x, y) 71.183 75.1560 147.17468 83.765 93.8635 1977.250 100 ## fooRcpp(x, y) 40.393 44.6970 63.59186 47.676 51.1535 1468.038 100 ## fooRcppSugar(x, y) 138.394 141.3745 179.16429 142.533 161.4045 1575.972 100Even if it is a vectorized base R function, ifelse is known to be slow.

ConclusionBeware when you use the ifelse function. Moreover, if you make a substantial number of calls to it, be aware that it isn’t very fast, but it exists at least 3 faster alternatives to it.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Florian Privé**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Converting mouse to human gene names with biomaRt package

(This article was first published on ** Let's talk about science with R**, and kindly contributed to R-bloggers)

Converting mouse gene names to the human equivalent and vice versa is not always as straightforward as it seems, so I wrote a function to simplify the task. The function takes advantage of the **getLDS()** function from the **biomaRt** to get the hgnc symbol equivalent from the mgi symbol. For example, let’s convert the following mouse gene symbols, *Hmmr*, *Tlx3*, and *Cpeb4*, to their human equivalent.

We can just as easily write a function to go from human to mouse genes.

# Basic function to convert human to mouse gene names convertHumanGeneList <- function(x){ require("biomaRt") human = useMart("ensembl", dataset = "hsapiens_gene_ensembl") mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl") genesV2 = getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=T) humanx <- unique(genesV2[, 2]) # Print the first 6 genes found to the screen print(head(humanx)) return(humanx) } genes <- convertMouseGeneList(humGenes)If you have any other suggestions on how to convert mouse to human gene names in R, I would love to hear them just email me at info@rjbioinformatics.com.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Let's talk about science with R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Association Rules on WideWorldImporters and SQL Server R Services

(This article was first published on ** R – TomazTsql**, and kindly contributed to R-bloggers)

Association rules are very handy for analyzing Retail data. And WWI database has really neat set of invoices that can be used to make a primer.

Starting with following T-SQL query:

USE WideWorldIMportersDW; GO ;WITH PRODUCT AS ( SELECT [Stock Item Key] ,[WWI Stock Item ID] ,[Stock Item] ,LEFT([Stock Item], 8) AS L8DESC ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP FROM [Dimension].[Stock Item] ) SELECT O.[WWI Order ID] ,O.[Order Key] ,O.[Stock Item Key] ,P.PRODUCT_GROUP ,O.[Description] FROM [Fact].[Order] AS O JOIN PRODUCT AS P ON P.[Stock Item Key] = O.[Stock Item Key] ORDER BY O.[WWI Order ID] ,O.[Order Key]

I have created very simple product group that will neglect distinction between product variants and treat them as one. For example:

Stock Item Key WWI Stock Item ID Stock Item 54 166 10 mm Anti static bubble wrap (Blue) 20m 53 167 10 mm Anti static bubble wrap (Blue) 50mBoth Products are initially the same just the product variant can change; color, size, cap, volume, etc. Product group denotes main products, “without” the product variants. I am doing this simplification out of practical reason, because of a smaller dataset.

So new version of product groups (variable ProductGroup) would be like:

Stock Item Key WWI Stock Item ID Stock Item ProductGroup 54 166 10 mm Anti 2 53 167 10 mm Anti 2So incorporating R code for analyzing association rules in sp_execute_external_procedure is what following code does:

-- Getting Association Rules into T-SQL DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'WITH PRODUCT AS ( SELECT [Stock Item Key] ,[WWI Stock Item ID] ,[Stock Item] ,LEFT([Stock Item], 8) AS L8DESC ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP FROM [Dimension].[Stock Item] ) SELECT O.[WWI Order ID] AS OrderID -- ,O.[Order Key] AS OrderLineID -- ,O.[Stock Item Key] AS ProductID ,P.PRODUCT_GROUP AS ProductGroup -- ,O.[Description] AS ProductDescription ,LEFT([Stock Item],8) AS ProductDescription FROM [Fact].[Order] AS O JOIN PRODUCT AS P ON P.[Stock Item Key] = O.[Stock Item Key] GROUP BY O.[WWI Order ID] ,P.PRODUCT_GROUP ,LEFT([Stock Item],8) ORDER BY O.[WWI Order ID]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N' library(arules) cust.data <- InputDataSet cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID), ProductGroup=as.factor(cust.data$ProductGroup)) cd_f2_tran <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]), "transactions") rules <- apriori(cd_f2_tran, parameter=list(support=0.01, confidence=0.1)) OutputDataSet <- data.frame(inspect(rules))' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( lhs NVARCHAR(500) ,[Var.2] NVARCHAR(10) ,rhs NVARCHAR(500) ,support DECIMAL(18,3) ,confidence DECIMAL(18,3) ,lift DECIMAL(18,3) ));

Result is retrieving rules of association between products from transaction that build up support and eventually give lift for any predictions.

By executing this R code:

# chart if needed plot(rules, method="grouped", control=list(k=20));one can generate also graphical view of the rules and associations between products.

And finally to retrieve information on support for each of the ProductGroup (which is my case), I would execute this R code embedded into T-SQL:

DECLARE @TSQL AS NVARCHAR(MAX) SET @TSQL = N'WITH PRODUCT AS ( SELECT [Stock Item Key] ,[WWI Stock Item ID] ,[Stock Item] ,LEFT([Stock Item], 8) AS L8DESC ,ROW_NUMBER() OVER (PARTITION BY LEFT([Stock Item], 8) ORDER BY ([Stock Item])) AS RN_ID_PR ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP FROM [Dimension].[Stock Item] ) SELECT O.[WWI Order ID] AS OrderID -- ,O.[Order Key] AS OrderLineID -- ,O.[Stock Item Key] AS ProductID ,P.PRODUCT_GROUP AS ProductGroup -- ,O.[Description] AS ProductDescription ,LEFT([Stock Item],8) AS ProductDescription FROM [Fact].[Order] AS O JOIN PRODUCT AS P ON P.[Stock Item Key] = O.[Stock Item Key] GROUP BY O.[WWI Order ID] ,P.PRODUCT_GROUP ,LEFT([Stock Item],8) ORDER BY O.[WWI Order ID]' DECLARE @RScript AS NVARCHAR(MAX) SET @RScript = N' library(arules) cust.data <- InputDataSet cd_f <- data.frame(OrderID=as.factor(cust.data$OrderID), ProductGroup=as.factor(cust.data$ProductGroup)) cd_f2_tran <- as(split(cd_f[,"ProductGroup"], cd_f[,"OrderID"]), "transactions") PgroupSets <- eclat(cd_f2_tran, parameter = list(support = 0.05), control = list(verbose=FALSE)) normalizedGroups <- PgroupSets[size(items(PgroupSets)) == 1] eachSupport <- quality(normalizedGroups)$support GroupName <- unlist(LIST(items(normalizedGroups), decode = FALSE)) OutputDataSet <- data.frame(GroupName, eachSupport);' EXEC sys.sp_execute_external_script @language = N'R' ,@script = @RScript ,@input_data_1 = @TSQL WITH RESULT SETS (( ProductGroup NVARCHAR(500) ,support DECIMAL(18,3) ));This ProductGroupID can be joined with T-SQL

in order to receive labels:

SELECT LEFT([Stock Item], 8) AS L8DESC ,DENSE_RANK() OVER (ORDER BY (LEFT([Stock Item], 8))) AS PRODUCT_GROUP FROM [Dimension].[Stock Item] GROUP BY LEFT([Stock Item], 8) Pros and consBiggest pro is the ability to integrate association rules with T-SQL and to have all R code working as it should be. This gives data wrangles, data scientiest and data managers to workout the rules that are hidden in transactional/basket data. Working out with different types of outputs (support, confidence, lift) user get to see immediately what works with what. In my case you see and tell that the amount of original data (little over 73K transactions and little over 200K rows) is sometimes not enough to generate meaningful rules that have relevant content. If dataset would have been 100x times bigger, I am sure this would not be a case.

Data size falls under the con. Having larger dataset to be analysed, this would be a performance drawback in terms of memory consumption (sp_execute_external_script procedure is not being able to use RevoScaleR package and *.xdf data file) and speed. If RevoScaleR Package would have a function to support this calculation, I am very much confident that there would only be pros to Association Rules learning algorithm.

To sum up, association rules is a great and powerful algorithm for finding the correlations between items and the fact that you can use this straight from SSMS, it just gives me goosebumps. Currently just the performance is a bit of a drawback. Also comparing this algorithm to Analysis services (SSAS) association rules, there are many advantages on R side, because of maneuverability and extracting the data to T-SQL, but keep in mind, SSAS is still very awesome and powerful tool for statistical analysis and data predictions.

Code is available at Github.

Happy R-TSQLing!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – TomazTsql**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Shiny Server (Pro) 1.4.7

(This article was first published on ** RStudio Blog**, and kindly contributed to R-bloggers)

Shiny Server 1.4.7.815 and Shiny Server Pro 1.4.7.736 are now available! This release includes new features to support Shiny 0.14. It also updates our Node.js to 0.10.47, which includes important security fixes for SSL/TLS.

Connection robustness (a.k.a. grey-outs)Shiny’s architecture is built on top of websockets, which are long-lived network connections between the browser and an R session on the server. If this connection is broken for any reason, the browser is no longer able to communicate with its R session on the server. Shiny indicates this to the user by turning the page background grey and fading out the page contents.

In Shiny 0.14 and Shiny Server 1.4.7, we’ve done work at both the server and package levels to minimize the amount of greyouts users will see. Simply by upgrading Shiny Server, transient (<15sec) network interruptions should no longer disrupt Shiny apps. And for many Shiny apps, a secondary, opt-in reconnection mechanism should all but eliminate grey-outs. This article on shiny.rstudio.com has all the details.

Bookmarkable stateShiny 0.14 introduced a “bookmarkable state” feature that made it possible to snapshot the state of a running Shiny app, and send it to someone as a URL to try in their own browser. At the app author’s option, the app state could either be fully encoded in the URL, or written to disk and referred to by a short ID. This latter approach requires support from the server, and that support is now officially provided by Shiny Server and Shiny Server 1.4.7. (This functionality is not yet available for ShinyApps.io, however.)

Coming soon: Shiny Server 1.5.0Just a heads up: Shiny Server (Pro) 1.5.0 is coming in a few weeks. Shiny Server was originally written using Node.js 0.10, which is nearing the end of its lifespan. This release will move to Node.js 6.x.

Due to the complexity of this upgrade, Shiny Server 1.5.0 will not add any new features, except for supporting perfect forward secrecy for SSL/TLS connections. The focus will be entirely on ensuring a smooth and stable release.

To **leave a comment** for the author, please follow the link and comment on their blog: ** RStudio Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Optimize Data Exploration With Sapply() – Exercises

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

The apply() functions in R are a utilization of the Split-Apply-Combine strategy for Data Analysis, and are a faster alternative to writing loops.

The sapply() function applies a function to individual values of a dataframe, and simplifies the output.

Structure of the sapply() function: sapply(data, function, ...)

The dataframe used for these exercises:

dataset1 <- data.frame(observationA = 16:8, observationB = c(20:19, 6:12))

Answers to the exercises are available here.

**Exercise 1**

Using sapply(), find the length of dataset1‘s observations:

**Exercise 2**

Using sapply(), find the sums of dataset1‘s observations:

**Exercise 3**

Use sapply() to find the quantiles of dataset1‘s columns:

**Exercise 4**

Find the classes of dataset1‘s columns:

**Exercise 5**

Required function:

DerivativeFunction <- function(x) { log10(x) + 1 }

Apply the “DerivativeFunction” to dataset1, with simplified output:

**Exercise 6**

Script the “DerivativeFunction” within sapply(). The data is dataset1:

**Exercise 7**

Find the range of dataset1:

**Exercise 8**

Print dataset1 with the sapply() function:

**Exercise 9**

Find the mean of dataset1‘s observations:

**Exercise 10**

Use sapply() to inspect dataset1 for numeric values:

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Make tilegrams in R with tilegramsR

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

In this busy election season (here in the US, at least), we're seeing a lot of maps. Some states are red, some states are blue. But there's a problem: voters are not evenly distributed throughout the United States. In this map (the firethirtyeight.com US election forecast on October 13) Montana (MT) is a large state shaded red, but only represents 3 of the 538 Electoral College votes. In the big scheme of things, the outcome in Montana doesn't have much impact on the election. Contrast that with the much smaller state New Jersey with 14 electoral votes: a state so small that its label (NJ) doesn't even fit on the map. A pixel in New Jersey represents almost 80x the voting power of a pixel in Montana, but because of its sheer size Montana dominates on the map.

This wouldn't be a problem if all states had an area directly proportional to the number of Electoral College votes, but that's not the case. But we can fix the problem, and make each state represent its voting power proportionately, by instead using a tiled cartogram, or tilegram. FiveThirtyEight helpfully provides a tilegram of its electoral forecasts as well:

This map gives a much better representative of Clinton's (blue) lead in the race over Trump (red), currently standing at 339 to 199 electoral college votes.

You can make tilegrams in R, thanks to the tilegramsR package by Bhaskar Karambelkar, and available on Github. Specifically, tilegramR provides spatial objects representing the US states scaled by electoral college votes or population, which you can then use in conjunction with the leaflet package to produce maps (and even add interactivity like pop-up data, if you wish). This RPubs page give several examples of creating tilegrams, inlcuding this map scaled by electoral college votes.

For more on the tilegramsR package, check out it home on Github linked below.

Github (bhaskarvk): tilegramsR (via FlowingData)

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### reproducible logo generated by ggtree

(This article was first published on ** R on Guangchuang YU**, and kindly contributed to R-bloggers)

ggtree provides many helper functions for manupulating phylogenetic trees and make it easy to explore tree structure visually.

Here, as examples, I used ggtree to draw capital character G and C, which are first letter of my name :-).

To draw a tree in such shape, we need fan layout (circular layout with open angle) and then rotating the tree to let the open space on the correct position. Here are the source codes to produce the G and C shapes of tree. I am thinking about using the G shaped tree as ggtree logo. Have fun with ggtree

### Global Temp: Geo-Spatial Records over Time

(This article was first published on ** data_steve**, and kindly contributed to R-bloggers)

Over the past couple months there have been datavizs showing global average temp trends. They have gotten a lot of media attention, so I got to see examples fairly regularly. Several of them (e.g., 1 and 2) used the same HadCRUT global temperature anamolies data, analyzing it with different techniques or technologies to highlight specific statistics, namely the mean. I liked @ucfagls’s simulation work on the averaged data and @hrbrmstr’s incorporation of the error model’s confidence intervals over time.

Most of these posts above emphasize the temporal nature of the data, as it is often the processed averaged values that they are using for data; these averages aggretate what observations are present over the globe at each time interval. After staring at these a while, I began to wonder at the decades-long downturns in the global averages that begin around the 1900s and again 1940s. These downturns left me stumped to reconcile what I knew generally of industrial economic development practices going on around the world at those times.

I decided to take a peak at the rawest data I could easily access from public records. Some fun data-play ensued. From that, I’ll be sharing a series of posts emphasing the geo-spatial nature of the temperature records as the collection patterns evolved over time.

The gif above is the output of the first steps I took to process the raw HadCRUT data.

HadCRUT4 dataIn order to get the rawest data possible for the HadCRUT4 you have to go here to get the components datasets it’s synthesized from. The HadSST3 are the sea-surface data and CRUTEM4 are the land-surface data.

In this post I’ll work with the sea-surface data, HadSST3. Other posts I’ll work with land-surface and compiled data, as well.

Three steps will be needed to complete the animated gif above:

- get the text data and split it up by its time intervals
- parse and clean the data
- create a fun gif showing recordkeeping over time

The pattern in the data is first line has all the meta data like month and year the data represents and then the next 36 contain the contents of the 36×72 numeric matrix of gridded temperature readings. So I use that pattern to create a regex to define the beginning of new year’s worth of data. And then just split the matrices up into a list of matrices.

# meta-data pattern defining when each new matrix begin ii <- grepl("\\d+\\s+\\d{4}\\s+\\d+\\s+Rows", data) sum(ii)/12+1850 # should add up to 2016+ counter <- 0 index <- rep(NA,length(ii)) for (j in 1:length(data)){ if(ii[j]){ counter=counter+1 index[j]=counter } else { index[j] = counter } } dat_list <- split(data,index) length(dat_list) all(sapply(dat_list,length)==37) # correctly parsed Parse and CleanThe main objective of this loop is to extract the year-month data from the meta data and to define the matrix. The matrices are very sparse, especially before 1900; so a simple ifelse will suffice to get at gridded locations of a measurement. Again the point is the locations, not whatever values are actually present in the data. (Feel free to use whatever parallelization tool you prefer here.)

Since this is monthly data over many years, I mainly want to give an overall progression over 150+ years. So I’ll be plotting the matrix for every 24 months, starting with the first Dec in the data, from 1850. I intend to use all the data in the next post; so we’ll have a chance to see all the data, but I didn’t want the gif to be too big.

yrmn_step <- seq(12,2000,by=24) # every other year yrmn_step dat_mat <- lapply(dat_list[yrmn_step], function(x) { # first line has meta data meta <- x[1] yr <- regmatches(meta,regexec("\\d{4}",meta))[[1]] mo <- regmatches(meta,regexec("\\d{1,2}",meta))[[1]][1] mo <- ifelse(as.integer(mo)<10,paste0("0",mo),mo) # remaining 36 have data mx <- x[2:37] mx <- sapply(mx, function(y) { m <- as.numeric(strsplit(trimws(y),"\\s+")[[1]]) ifelse(m==-99.99, 0,1) # replace missing with 0, non-missing with 1 }) m <- matrix(mx,36,72) colnames(m) <- c(paste0(seq(175,5,by=-5),"W"),0,paste0(seq(5,180, by=5),"E")) rownames(m) <- c(paste0(seq(85,5,by=-5),"S"),"Eq",paste0(seq(5,90, by=5),"N")) ym <- paste0(yr,"/",mo) list(ym,m) }) Animate gifI would recommend sending the output of this task to its own folder to help with the animation step later.

plots <- file.path(getwd(),"plots/had_plots/")Since the data is one big numerical matrix, plotting in base R using image seemed the fastest and most similar to other graphs you see on the government websites. I’ve included a horizonal dashed line to keep the Equator evident.

# plot every other Dec snapshots lapply(dat_mat, function(x){ png(paste0(plots,substr(x[1],1,4),".png")) image(x[2][[1]], col=c("white", "blue"), axes=FALSE , main="Geo-Locate Sea-Surface Temp Readings\nby Year/Month") abline(h=.5, lty="dashed") text(y=.98,x=.5,labels=x[1], col="red", cex=1.5) axis(1, at = c(0,0.25, 0.5, 0.75, 1), labels=c("180W", "90W", "0", "90E", "180E"), srt=45,tick=FALSE) axis(2, at = c(0,0.5,1), labels=c("90S", "Eq", "90N") ,srt=45,tick=FALSE) dev.off() })Now to apply the animation. I suppose you could use the animation package, but I just used the command line tools. I’ll show you how. First change directories to where your new pngs are. You can download ffmpeg from here for whatever platform. I found the system tool could successfully call the ffmpeg function convert and get it to work. (On my Mac OS, system2 could handle the pwd call, but not convert.)

# install ffmpeg from https://ffmpeg.org/download.html setwd(plots) system("pwd") system("convert -loop 1850 *.png animated.gif")The animated.gif file should be in the same folder.

In the next post, I’ll return to the monthly data. Using all the data this time, I’ll do some summarizations and plotting of the geo-spatial variation over time.

To **leave a comment** for the author, please follow the link and comment on their blog: ** data_steve**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Eight (not 10) things an R user will find frustrating when trying to learn Python

(This article was first published on ** Mango Solutions » R Blog**, and kindly contributed to R-bloggers)

When speaking with clients and other R users at events such as LondonR and EARL I’ve noticed an increasing trend in people looking to learn some python as the next step in their data science journey. At Mango most of our consultants are pretty happy using either language but as an R user of 12 or so years I’ve only ever dabbled with python. Recently however I found myself having to learn quickly and so I thought I’d share some of my observations.

Before you stop reading I should say that I am fully aware that there are many blog posts covering the high level pros and cons of each language. For this post I thought I’d get down to the nitty gritty. What does an R user really experience when trying to pick up python? In particular what does an R user that comes from a statistics background experience?

Personally I found eight (I wanted 10 but python is too good) and here they are:

**Lack of Hadley**. So there is a Wes but there is a lot of duplication in functionality between packages. To start with you import statistics and find the mean function only to find it has been re-written for pandas. Later you find that everyone has their own idea on the best way to implement cross-validation. All very confusing when you start out. This brings me on to:**Plotting**. I had heard a lot of good things about matplotlib and seaborn but ggplot2 is streets ahead (IMHO). I would even go as far as to say that ggplot2 has a shallower learning curve.**IDEs**. Hats off to RStudio for changing the R world when it comes to IDEs. I remember a time before RStudio when the R GUI, StatET and Tinn-R were the norm. How things have improved. Sadly, python is not quite there yet. As an RStudio user I opted for Spyder. It’s OK but the script editor needs some work. The integration in Jupyter Notebook seems much better when I chat with colleagues but I’m just not a big fan of notebooks.**Namespaces**. I’ve lost count of the number of times I’ve told trainees on an intro to R course that masking very rarely trips you up as a user (unless you’re building packages it really doesn’t). Let’s just say that in python you have to be careful. Bring too much in and you’ll overwrite your own objects and cause chaos. This means you bring in things as and when you need them. Having to explicitly import OS utilities in order to change the working directory and so on is frustrating. That said, python’s capabilities are a little better than R in this area.**Object****Orientation**. I’ve grown to love R’s flexible S3 classes with lines like:

In python I am never quite sure what methods exist for an object and when to just go functional. You also really have to know about classes to work with python effectively whereas a casual R user can get by without even knowing that R has a class system.

**Reliance on R**. On my recent project I was using the best of the statistical capabilities in python. First off I should say that it’s basically all there (except for stepwise GLMs for some bizarre reason). However, although I’ve always known that most of the statistical modelling capabilities in python have been ported from R the documentation is pretty lazy and most of it just points you at the R documentation. The example datasets are even the same! Speaking of the documentation.**Help documentation**. I can only speak for the more popular packages in the two languages but the R documentation is much more plentiful and generally contains a lot more examples.**Zero-based arrays**. I couldn’t write a list without this coming up. I do love it when smug coders that have developed in other languages tell me that R is the exception here by indexing from 1. However, as a human being I count from 1 and this will always make more sense to me. Ending at n-1 is also confusing. Compare:

What I was impressed by was how extensively the statistical capabilities in R have been ported to python (I wasn’t expecting the mixed modelling or survival analysis capabilities to be anything like that in R for example). However, as an existing R user there really is no point in switching to python for statistics. The only benefit would be if you were using python for, say, extensive web-scraping and you wanted to be consistent. If that’s your reason though then let me point you towards Chris Musselle’s blog post, “Integrating Python and R Part II – Executing R from Python and Vice Versa”. And don’t forget that you can also just use *rvest*.

So my advice would be if you’re going to try to learn python, don’t learn it with the intention of using it to build models. Learn it because it’s a more flexible all-round programming language and you have some heavy lifting to do. Just find something that’s hard to do in R and try using python for that. Otherwise you’ll end up like me, writing a whingy blog post!

To **leave a comment** for the author, please follow the link and comment on their blog: ** Mango Solutions » R Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...