# Methodology Blogs

### Blind Alley

Paul Alper points in a comment to an excellent news article by James Glanz and Agustin Armendariz:

Dr. Carlo Croce is among the most prolific scientists in an emerging area of cancer research . . . a member of the National Academy of Sciences, Dr. Croce has parlayed his decades-long pursuit of cancer remedies into a research empire: He has received more than $86 million in federal grants . . .

Over the last several years, Dr. Croce has been fending off a tide of allegations of data falsification and other scientific misconduct, according to federal and state records, whistle-blower complaints and correspondence with scientific journals obtained by The New York Times.

In 2013, an anonymous critic contacted Ohio State and the federal authorities with allegations of falsified data in more than 30 of Dr. Croce’s papers. Since 2014, another critic, David A. Sanders, a virologist who teaches at Purdue University, has made claims of falsified data and plagiarism directly to scientific journals where more than 20 of Dr. Croce’s papers have been published. . . .

From just a handful of notices before 2013 — known as corrections, retractions and editors’ notices — the number has ballooned to at least 20, with at least three more on the way, according to journal editors. Many of the notices involve the improper manipulation of a humble but universal lab technique called western blotting, which measures gene function in a cell and often indicates whether an experiment has succeeded or failed.

Hey—this sounds pretty bad!

Despite the lashing criticisms of his work, Dr. Croce has never been penalized for misconduct, either by federal oversight agencies or by Ohio State, which has cleared him in at least five cases involving his work or the grant money he receives. . . . Now, in the wake of those and other questions from The Times, the university has decided to take a new look to determine whether it handled those cases properly. . . . Whatever the outcome of that review, Mr. Davey said, decisions on research misconduct at Ohio State were based solely on “the facts and the merits of each individual case,” not a researcher’s grant money. Any other suggestion would be “false and offensive,” he said, adding that the university has “spent significantly more to support his research program than he has brought in from outside sources.”

Sunk cost fallacy, anyone?

But let’s hear Croce’s side of the story:

During an interview in October, and in a later statement, Dr. Croce, 72, denied any wrongdoing . . . “It is true that errors sometimes occur in the preparation of figures for publication,” Dr. Croce said in the statement, issued through the Columbus law firm Kegler Brown Hill & Ritter. Any mistakes with figures were “honest errors,” he said, adding that he did not condone plagiarism but that he must rely on co-authors to provide proper attribution.

Also this juicy bit:

Dr. Croce, who has a medical degree but no Ph.D., showed his own willingness to buck scientific consensus when he became an adviser to the Council for Tobacco Research. A federal court later found that the council was central to a conspiracy to deceive the public on the dangers of smoking. Dr. Croce stayed on until the council was disbanded in the industry’s more than $100 billion settlement with tobacco plaintiffs in 1998.

Wow, this guy touches all the bases: National Academy of Sciences, plagiarism, fake data, cigarette funding, . . . I haven’t looked at any of the guy’s papers but I can only assume that, even in the legitimate work, the conclusions based on p-values obtained via the garden of forking paths and that all his estimates are exaggerated. Just throw in some evolutionary psychology, a collaboration with Dr. Anil Potti, and statistical consulting by Weggy, and we’ve pretty much got it all.

**P.S.** Check out that New York Times article. What are the chances of finding an article whose two authors both have names ending in z? (1/26)^2, that’s p less than 0.001. Highly statistically significant. Someone send this to Susan Fiske so it can be published in PPNAS!

The post Blind Alley appeared first on Statistical Modeling, Causal Inference, and Social Science.

### box plot [xkcd]

### Employee Retention with R Based Data Science Accelerator

*by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)*

Employee retention has been and will continue to be one of the biggest challenges of a company. While classical tactics such as promotion, competitive perks, etc. are practiced as ways to retain employees, it is now a hot trend to rely on machine learning technology to discover behavioral patterns with which a company can understand their employees better.

Employee demographic data has been studied and used for analyzing employees’ inclination towards leaving a company. Nowadays, with the proliferation of the Internet, employees’ behavior can better be understood and analyzed through such data as internal and external social media postings. Such data can be leveraged for the analysis of, for example, sentiment and thereby determination of an employees likelihood of leaving the company. Novel cognitive computing technology based on artificial intelligence tools empower today’s HR departments to identify staff who are likely to churn before they do. Through pro-active intervention HR can better manage staff to encourage them to remain longer term with the company.

This blog post introduces an R based data science accelerator that can be quickly adopted by a data scientist to prototype a solution for the employee attrition prediction scenario. The prediction is based on two types of employee data that are typically already collected by companies:

- Static data which does not tend to change over time. This type of data may refer to demographic and organizational data such as age, gender, title, etc. A characteristic of this type of data is that within a certain period they do not change or solely change in a deterministic way. For instance, years of service of an employee is static as the number increments every year.
- The second type of data is the dynamically evolving information about an employee. Recent studies revealed that sentiment is playing a critical role in employee attrition prediction. Classical measures of sentiment require employee surveys of work satisfaction work. Social media posts become useful for sentiment analysis as employees may express their feelings about work. Non-structural data such as text can be collected for mining patterns which are indicative of employees with different inclinations for churn.

Attrition prediction is a scenario that takes historic employee data as input to then identify individuals that are inclined to leave. The basic procedure is to extract features from the available data that might have previously been manually analyzed and to build predictive models based on a training set with labels relating to the employment status. Normally it can be formalized as a supervised classification problem, while the uniqueness is that population of employees with different employment status may not be equal. Training such an imbalanced data set requires resampling or cost-sensitive learning techniques. For sentiment analysis on unstructured data such as text, pre-processing techniques that extract analysis-friendly quantitative features should be applied. Commonly used feature extraction methods for text analysis include word-to-vector, term frequency, or term frequency and inverse document frequency, etc. Algorithms for building the model depend on the data characteristics. In case a specific algorithm does not yield the desired results, we have found that ensemble techniques can be deployed to effectively boost model performance.

The data science language R is a convenient tool for performing HR churn prediction analysis. A lightweight data science accelerator that demonstrates the process of predicting employee attrition is shared in this Github repository. The walk-through basically shows cutting-edge machine learning and text mining techniques applied in R.

The code for the analytics is provided in an R markdown document, which can be interactively executed step by step to aid replication and learning. Documents of various formats (e.g., PDF, html, Jupyter Notebook, etc.) can be produced directly from the markdown document.

Taking this employee attrition data set, for example, one can easily visualize and perform correlation analysis with R packages. For instance, we may plot the distribution of monthly income of employees from different departments and intuitively analyze whether income can be a factor that affects attrition.

df <- read.xlsx("./attrition.xlsx", sheetIndex=1, header=TRUE, colClasses=NA) ggplot(df, aes(x=factor(Department), y=MonthlyIncome, color=factor(Attrition))) + geom_boxplot() + xlab("Department") + ylab("Monthly income") + scale_fill_discrete(guide=guide_legend(title="Attrition")) + theme_bw()R packages such as tm offer handy functions for dealing with text mining tasks on employees’ social media posts. Sometimes text analysis on multiple languages are of great interest to employers that have subsidiaries across different language-speaking locations. This can be done either via language-specific text analysis package (e.g., jiebaR for Chinese segmentation) or on-line translation. This R code demonstrates how to perform tokenization and term frequency count on English-Chinese text with jiebaR and tm.

For the complete solution for employee attrition, follow the link below.

Github (Microsoft): Data Science Accelerator - Employee Attrition Prediction with Sentiment Analysis

### Science communication is not a one-shot game

In our recent discussion of Ted doubling down on power pose, commenter Michael raised an interesting question:

I think the general attitude of most people who work on communicating science to the public is that their responsibility is only to make sure that any information they present has a source with the proper credentials (published in a peer-reviewed journal, endorsed by PhD experts in the relevant disciplines at universities). Since they are not themselves PhD experts, the feeling is that “Who am I to challenge this expert? I am just telling you what my expert says, it’s not my job to get involved in these obscure internal arguments”. . . . If Slate can let Andrew Gelman write an article, or Retraction Watch can publish an interview with him expressing his position without publishing comments from experts with objectively equal qualifications who disagree, why can’t TED let Amy Cuddy put out her ideas? How should someone outside of the relevant disciplines be expected to know when what an expert is saying needs to be challenged? I can’t think of a good solution.

I replied as follows:

One difference between Cuddy’s Ted talk and my Slate articles is that I take the other side of the argument seriously, even if I express disagreement.

For example, today in Slate I looked into Jon Krosnick’s claim that the outcome of the 2016 election was determined by Trump being listed first on the ballot in many swing states. I concluded that it was possible but that I was skeptical that the effects would’ve been large. True, Slate did not invite Krosnick to respond. But in my article I linked to Krosnick’s statement, I clearly stated my sources of evidence, I linked and took seriously a research article by Krosnick and others on the topic . . . I did my due diligence.

In contrast, the Ted team avoids linking to criticisms of Cuddy’s work, and I do not consider her statements to be in the full spirit of scientific inquiry. It seems like a damage control operation more than anything else. As to the original Carney, Cuddy, and Yap article: as I noted above, it makes a claim in the abstract that is not supported by anything in the paper. And more recently Carney gave a long list of problems with the paper, which again Cuddy is not seriously addressing.

This response is fine as far as it goes, but I realized something else is going on, which is that Slate and Ted and other media outlets get multiple chances. If Jon Krosnik can make a strong case that my skepticism in his theory is misplaced, he can write about it—and even if Slate doesn’t run Krosnick’s (hypothetical) right away, they’d certainly be aware of it if they were to do more reporting on ballot-order effects. What about Ted? It’s hard to fault them for greenlighting Cuddy’s talk: at the time, the Carney/Cuddy/Yap paper had not been widely criticized, the failed replications were still in the future, and Cuddy had that Harvard credential. So, fine. My problem with Ted comes later, when they continue to endorse the work—and, more recently, to go all-in on it—in a way that dodges all criticism. I can’t expect typical journalistic outlets to discover flaws in research claims that already managed to get past peer review. But I can fault them for not updating their priors in light of new information.

Journalism, like science and bookmaking, is a multiple-round game, and the big fails can come from not knowing when to cut your losses.

The post Science communication is not a one-shot game appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Dataviz of the week: 8/3/17

Corinne Riddell posted this on Twitter. It’s one version of multiple time series that she tried out, one for each USA state. It’s not the finished article, but is really nice for its combination of that recognisable shape (I suppose if your country has a dull shape like Portugal — no offence — then readers won’t immediately recognise the meaning of the arrangement) and the clean, simple small multiples. Admittedly, the time series has enough signal:noise to make this possible, and only a few unusual states, and without that it might start to get like spaghetti, but it’s always worth sketching options like this out to see how they work.

Could the state names get dropped? Probably not, but they could turn into two-letter abbreviations. The whole idea of the small multiple is that it’s not a precise x-y mapping but a general impression, so the y-axis labels could go (just as there are no x-axis labels).

Overall, I really like it and would like to write up a UK function for this to add to my dataviz toolbox.

### Infographic tools before the computer

There was a time when charts and infographics were drawn by hand, because computers weren’t affordable or commonplace. John Grimwade provides a run down of the tools he used back in the day. The thing above was used to draw different sized circles. On paper. With a pen.

The tools remind me a lot of my dad’s that he used to have laying around. He’s a civil engineer and used to pull out all these blueprints with measurements and codes. Now they just have giant, high-resolution computer screens.

I wonder what my kids will think of the tools I use when they’re older. “You mean you have to use your hands to visualize your data? That’s like a baby’s toy.”

**Tags:** hand-drawn, tools

### Follow-up forecasting forum in Eindhoven

(This article was first published on ** R – Hyndsight**, and kindly contributed to R-bloggers)

Last October I gave a 3-day masterclass on “Forecasting with R” in Eindhoven, Netherlands. There is a follow-up event planned for Tuesday 18 April 2017. It is particularly designed for people who attended the 3-day class, but if anyone else wants to attend they would be welcome.

Please **register here** if you want to attend.

The preliminary schedule is as follows.

10.00 — 11.00 New developments in forecasting using R- forecast v8.0
- prophet
- forecastHybrid
- opera

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Hyndsight**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A fistful of Stan case studies: divergences and bias, identifying mixtures, and weakly informative priors

Following on from his talk at StanCon, Michael Betancourt just wrote three Stan case studies, all of which are must reads:

- Diagnosing Biased Inference with Divergences: This case study discusses the subtleties of accurate Markov chain Monte Carlo estimation and how divergences can be used to identify biased estimation in practice.

- Identifying Bayesian Mixture Models: This case study discusses the common pathologies of Bayesian mixture models as well as some strategies for identifying and overcoming them.

- How the Shape of a Weakly Informative Prior Affects Inferences: This case study reviews the basics of weakly-informative priors and how the choice of a specific shape of such a prior affects the resulting posterior distribution.

**Reproducible R scripts**

They all come with fully reproducible knitr scripts to run in RStan. The same lessons hold for the other interfaces, so don’t let the R put you off.

**A spectator sport**

It was particularly fun sitting in the office the day Michael went and figured out all the mixture modeling properties. It followed on from one of our Stan meetings and some of my own failed experiments.

**Publish or perish**

It’s really a shame that this kind of methodological study is so hard to publish, because all three of these deserve to be widely cited. Maybe Andrew has some ideas of how to turn these into “regular” papers. The main thing journal articles give us is a way to argue that we got research results from our research grants. Not a small thing!

**Other case studies**

We have a bunch of case studies up and are always looking for more. The full list and instructions for submitting are on the Stan case studies page.

**Getting information out of the posterior fit object in R**

And in case you didn’t see it, Jonah wrote up a guide for how to extract the kind of information you need for extracting information from a Stan fit object in R.

The post A fistful of Stan case studies: divergences and bias, identifying mixtures, and weakly informative priors appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Building views with R

(This article was first published on ** R blog | Quantide - R training & consulting**, and kindly contributed to R-bloggers)

[Here you can see the Building views with R cheat sheet at a full resolution]

QueriesIn database theory a query is a request for data or information from a database table or combination of tables.

Since dplyr we have something that quite closely conceptually resembles a query in R:

require(dplyr) ## Warning: package 'dplyr' was built under R version 3.2.5 require(pryr) mtcars %>% tbl_df() %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg)) ## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 26.66364 4.509828 ## 2 6 19.74286 1.453567 ## 3 8 15.10000 2.560048
I particularly appreciate of dplyr the possibility of building my *query* as a step by step set of R statement that I can progressively test at each step.

Views

Again in database theory, a *view* is the result set of a stored query on the data, which the database users can query just as they would in a table.

I would like to have something similar to a view in R

As far as I know, I can achieve this goal in three ways:

- Function makeActiveBinding
- Operator %>a% from package pryr
- My proposed `%>>% operator

Function makeActiveBinding()

Function makeActiveBinding(sym, fun, env) installs a function in an environment env so that getting the value of sym calls fun with no arguments.

As a basic example I can actively bind a function that simulates a dice to an object named dice :

makeActiveBinding("dice", function() sample(1:6, 1), env = globalenv())so that:

replicate(5 , dice) ## [1] 5 1 6 2 3Similarly, I can wrap adplyr expression into a function:

f <- function() {mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg))}and then actively bind it to a symbol:

makeActiveBinding('view', f , env = globalenv())so that, any time we call view the result of function f()is computed again:

view ## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 26.66364 4.509828 ## 2 6 19.74286 1.453567 ## 3 8 15.10000 2.560048As a result, if I change any value of mpg within mtcars, view is automatically updated:

mtcars$mpg[c(1,3,5)] <- 0 view ## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 24.59091 9.231192 ## 2 6 16.74286 7.504189 ## 3 8 13.76429 4.601606Clearly, I have to admit that all of this looks quite unfriendly, at least to me.

Operator %<a-%

A valid alternative, that wraps away the complexity of function makeActiveBinding() is provided by operator %<a-% from package pryr:

view %<a-% {mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg))}Again, if I change any value of mpg within mtcars, the value of view get automatically updated:

mtcars$mpg[c(1,3,5)] <- 50 view ## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 29.13636 8.159568 ## 2 6 23.88571 11.593451 ## 3 8 17.33571 9.688503Note that in this case I have to enclose the whole expression within curly brackets.

Moreover, the final assignment: %<a-% goes on the left hand side of my chain of dplyr statements.

Operator %>>%

Finally I would like to propose a third alternative, still based on makeActiveBinding(), that I named %>>%

`%>>%` <- function( expr, x) { x <- substitute(x) call <- match.call()[-1] fun <- function() {NULL} body(fun) <- call$expr makeActiveBinding(sym = deparse(x), fun = fun, env = parent.frame()) invisible(NULL) }that can be used as:

mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg)) %>>% viewAnd again, if I change the values of mpg:

mtcars$mpg[c(1,3,5)] <- 100The content of view changes accordingly

view ## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 33.68182 22.41624 ## 2 6 31.02857 30.44321 ## 3 8 20.90714 22.88454I believe this operator offers two advantages:

- Avoids the usage of curly brackets around my dplyr expression
- Allows me to actively assign the result of my chain of dplyr statements, in a more
*natural way*at the end of the chain

The post Building views with R appeared first on Quantide – R training & consulting.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R blog | Quantide - R training & consulting**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### In case you missed it: February 2017 roundup

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from February of particular interest to R users.

Public policy researchers use R to predict neighbourhoods in US cities subject to gentrification.

The ggraph package provides a grammar-of-graphics framework for visualizing directed and undirected graphs.

Facebook has open-sourced the "prophet" package they use for forecasting time series at scale.

A preview of features coming soon to R Tools for Visual Studio 1.0.

On the differences between using Excel and R for data analysis.

A data scientist suggests a "Gloom Index" for identifying the most depressing songs by the band Radiohead.

Catterplots is a package that replaces scatterplot points with cats. (Tufte would not approve.)

A collection of tips on using Microsoft R Server from the support team.

A summary of the many improvements slated for R 3.4.0.

R code using the RevoScaleR package to classify a large database of galaxy images in SQL Server.

A review of four deep learning packages for R: MXNet, darch, deepnet and h2o.

An update on more than a dozen projects and community initiatives funded by R Consortium grants.

R has overtaken SAS for Statistics job listings on indeed.com.

ModernDive is a free online textbook on Statistics and data science using R.

A solution (with R code) for modeling customer churn in the retail industry using SQL Server R Services.

The superheat package provides enhanced heatmap graphics for R.

The fst package provides a new serialization format for R data focused on performance.

Thomas Dinsmore reflects on major events in the R Project and for Microsoft in 2016.

And some general interest stories (not necessarily related to R): A big drain; 'Vous' vs 'tu'; a remembrance of the late Hans Rosling; and Ten Meter Tower, a short film.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### In case you missed it: February 2017 roundup

In case you missed them, here are some articles from February of particular interest to R users.

Public policy researchers use R to predict neighbourhoods in US cities subject to gentrification.

The ggraph package provides a grammar-of-graphics framework for visualizing directed and undirected graphs.

Facebook has open-sourced the "prophet" package they use for forecasting time series at scale.

A preview of features coming soon to R Tools for Visual Studio 1.0.

On the differences between using Excel and R for data analysis.

A data scientist suggests a "Gloom Index" for identifying the most depressing songs by the band Radiohead.

Catterplots is a package that replaces scatterplot points with cats. (Tufte would not approve.)

A collection of tips on using Microsoft R Server from the support team.

A summary of the many improvements slated for R 3.4.0.

R code using the RevoScaleR package to classify a large database of galaxy images in SQL Server.

A review of four deep learning packages for R: MXNet, darch, deepnet and h2o.

An update on more than a dozen projects and community initiatives funded by R Consortium grants.

R has overtaken SAS for Statistics job listings on indeed.com.

ModernDive is a free online textbook on Statistics and data science using R.

A solution (with R code) for modeling customer churn in the retail industry using SQL Server R Services.

The superheat package provides enhanced heatmap graphics for R.

The fst package provides a new serialization format for R data focused on performance.

Thomas Dinsmore reflects on major events in the R Project and for Microsoft in 2016.

And some general interest stories (not necessarily related to R): A big drain; 'Vous' vs 'tu'; a remembrance of the late Hans Rosling; and Ten Meter Tower, a short film.

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

### web [mis-]content

**F**or the past two weeks, I have noticed a web content process on my computer, process that eats a lot of my CPU! And I have not found any final solution when looking on Linux/Ubuntu fora/forums… Moving to a new [and neat] laptop [as the older one broke an hinge that could not be fixed!] did not help, while killing the process by itself saw this very tab vanish on Firefox and the process reappear a few seconds later. Solving this issue seems beyond my reach!

Filed under: Statistics Tagged: laptop, Linux, Ubuntu 16.04, Web Content

### sigr: Simple Significance Reporting

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

sigr is a simple R package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use.

Model Example

Let’s take as our example the following linear relation between x and y:

library('sigr') set.seed(353525) d <- data.frame(x= rnorm(5)) d$y <- 2*d$x + 0.5*rnorm(nrow(d))stats::lm() has among the most complete summaries of all models in R, so we easily get can see the quality of fit and its significance:

model <- lm(y~x, d=d) summary(model) ## ## Call: ## lm(formula = y ~ x, data = d) ## ## Residuals: ## 1 2 3 4 5 ## 0.3292 0.3421 0.0344 -0.1671 -0.5387 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4201 0.2933 1.432 0.24747 ## x 2.1117 0.2996 7.048 0.00587 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4261 on 3 degrees of freedom ## Multiple R-squared: 0.943, Adjusted R-squared: 0.9241 ## F-statistic: 49.67 on 1 and 3 DF, p-value: 0.005871sigr::wrapFTest() can render the relevant model quality summary.

cat(render(wrapFTest(model), pLargeCutoff=1))**F Test** summary: (*R2*=0.94, *F*(1,3)=50, *p*=0.0059).

Note: sigr reports the un-adjusted R-squared, as this is the one that significance is reported for in summary::lm().

sigr also carries around the important summary components for use in code.

unclass(wrapFTest(model)) ## $test ## [1] "F Test" ## ## $numdf ## [1] 1 ## ## $dendf ## [1] 3 ## ## $FValue ## [1] 49.66886 ## ## $R2 ## [1] 0.9430403 ## ## $pValue ## [1] 0.005871236In this function sigr is much like broom::glance() or modelr::rsquare().

broom::glance(model) ## r.squared adj.r.squared sigma statistic p.value df logLik ## 1 0.9430403 0.9240538 0.4261232 49.66886 0.005871236 2 -1.552495 ## AIC BIC deviance df.residual ## 1 9.10499 7.933304 0.544743 3 modelr::rsquare(model, d) ## [1] 0.9430403This is something like what is reported by caret::train() (where caret controls the model training process).

cr <- caret::train(y~x, d, method = 'lm') ## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = ## trainInfo, : There were missing values in resampled performance measures. cr$results ## intercept RMSE Rsquared RMSESD RsquaredSD ## 1 TRUE 0.9631662 0.9998162 0.6239989 0.0007577834 summary(cr$finalModel) ## ## Call: ## lm(formula = .outcome ~ ., data = dat) ## ## Residuals: ## X1 X2 X3 X4 X5 ## 0.3292 0.3421 0.0344 -0.1671 -0.5387 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4201 0.2933 1.432 0.24747 ## x 2.1117 0.2996 7.048 0.00587 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4261 on 3 degrees of freedom ## Multiple R-squared: 0.943, Adjusted R-squared: 0.9241 ## F-statistic: 49.67 on 1 and 3 DF, p-value: 0.005871(I presume cr$results$Rsquared is a model quality report and not a consistency of cross-validation procedure report. If it is a model quality report it is somehow inflated, perhaps by the resampling procedure. So I apologize for using either caret::train() or its results incorrectly.)

Data exampleThe issues in taking summary statistics (and significances) from models include:

- Working only from models limits you to models that include the statistic you want.
- Working only from models is mostly "in-sample." You need additional procedures for test or hold-out data.

With sigr it is also easy to reconstruct quality and significance from the predictions, no matter where they came from (without needing the model data structures).

In-sample example d$pred <- predict(model, newdata = d) cat(render(wrapFTest(d, 'pred', 'y'), pLargeCutoff=1))**F Test** summary: (*R2*=0.94, *F*(1,3)=50, *p*=0.0059).

Notice we reconstruct the summary statistic and significance, independent of the model data structures. This means the test is generic and can be used on any regression (modulo informing the significance model of the appropriate number of parameters). It also can be used on held-out or test data.

In this mode it is a lot like ModelMetrics::rmse() or caret::postResample().

ModelMetrics::rmse(d$y, d$pred) ## [1] 0.3300736 caret::postResample(d$y, d$pred) ## RMSE Rsquared ## 0.3300736 0.9430403 Out of sample exampleIf we had more data we can look at the quality of the prediction on that data (though you have to take care in understanding the number of degrees of freedom is different for held-out data).

d2 <- data.frame(x= rnorm(5)) d2$y <- 2*d2$x + 0.5*rnorm(nrow(d2)) d2$pred <- predict(model, newdata = d2) cat(render(wrapFTest(d2, 'pred', 'y'), pLargeCutoff=1))**F Test** summary: (*R2*=0.76, *F*(1,3)=9.6, *p*=0.054).

We could have used glmnet instead of lm:

library("glmnet") d$one <- 1 # make sure we have at least 2 columns xmat <- as.matrix(d[, c('one','x')]) glmnetModel <- cv.glmnet(xmat, d$y) ## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations ## per fold summary(glmnetModel) ## Length Class Mode ## lambda 53 -none- numeric ## cvm 53 -none- numeric ## cvsd 53 -none- numeric ## cvup 53 -none- numeric ## cvlo 53 -none- numeric ## nzero 53 -none- numeric ## name 1 -none- character ## glmnet.fit 12 elnet list ## lambda.min 1 -none- numeric ## lambda.1se 1 -none- numeric d$predg <- as.numeric(predict(glmnetModel, newx= xmat)) cat(render(wrapFTest(d, 'predg', 'y'), pLargeCutoff=1))**F Test** summary: (*R2*=0.91, *F*(1,3)=30, *p*=0.012).

Because sigr can render to "LaTex" it can (when used in conjunction with latex2exp) also produce formatted titles for plots.

library("ggplot2") library("latex2exp") f <- paste0(format(model$coefficients['x'], digits= 3), '*x + ', format(model$coefficients['(Intercept)'], digits= 3)) title <- paste0("linear y ~ ", f, " relation") subtitle <- latex2exp::TeX(render(wrapFTest(d, 'pred', 'y'), format= 'latex')) ggplot(data=d, mapping=aes(x=pred, y=y)) + geom_point() + geom_abline(color='blue') + xlab(f) + ggtitle(title, subtitle= subtitle) Conclusionsigr computes a few statistics or summaries (with standard significance estimates) and returns the calculation in both machine readable and formatted forms. For non-standard summaries sigr includes some resampling methods for significance estimation. For formatting sigr tries to get near the formats specified by "The American Psychological Association." sigr works with models (such as lm and glm(family='binomial')) or data, and is a small package with few dependencies.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R 3.3.3 is released!

(This article was first published on ** R – R-statistics blog**, and kindly contributed to R-bloggers)

R 3.3.3 (codename “Another Canoe”) was released yesterday You can get the latest binaries version **from here.** (or the .tar.gz **source** code from here). The full list of bug fixes and new features is provided below.

A quick summary by David Smith:

Upgrading to R 3.3.3 on WindowsR 3.3.3 fixes an issue related to attempting to use download.file on sites that automatically redirect from http to https: now, R will re-attempt to download the secure link rather than failing. Other fixes include support for long vectors in the vapply function, the ability to use pmax (and pmin) on ordered factors, improved accuracy for qbeta for some extreme cases, corrected spelling for “Cincinnati” in the precip data set, and a few other minor issues.

If you are using **Windows **you can easily upgrade to the latest version of R using the installr package. Simply run the following code in Rgui:

Running “updateR()” will detect if there is a new R version available, and if so it will download+install it (etc.). There is also a step by step tutorial (with screenshots) on how to upgrade R on Windows, using the *installr* package. If you only see the option to upgrade to an older version of R, then change your mirror or try again in a few hours (it usually take around 24 hours for all CRAN mirrors to get the latest version of R).

*I try to keep the installr package updated and useful, so if you have any suggestions or remarks on the package – you are invited to open an issue in the github page.*

- Changes when redirection of a http:// URL to a https:// URL is encountered:
- The internal methods of download.file() and url() now report that they cannot follow this (rather than failing silently).
- (Unix-alike) download.file(method = "auto") (the default) re-tries with method = "libcurl".
- (Unix-alike) url(method = "default") with an explicit open argument re-tries with method = "libcurl". This covers many of the usages, e.g. readLines() with a URL argument.

- The configure check for the zlib version is now robust to versions longer than 5 characters, including 1.2.11.

- Environmental variable _R_CHECK_TESTS_NLINES_ controls how R CMD check reports failing tests (see §8 of the ‘R Internals’ manual).

- (C-level Native routine registration.) The undocumented styles field of the components of R_CMethodDef and R_FortranMethodDef is deprecated.

- vapply(x, *) now works with long vectors x. (PR#17174)
- isS3method("is.na.data.frame") and similar are correct now. (PR#17171)
- grepRaw(<long>, <short>, fixed = TRUE) now works, thanks to a patch by Mikko Korpela. (PR#17132)
- Package installation into a library where the package exists
*via*symbolic link now should work wherever Sys.readlink() works, resolving PR#16725. - "Cincinnati" was missing an "n" in the precip dataset.
- Fix buffer overflow vulnerability in pdf() when loading an encoding file. Reported by Talos (TALOS-2016-0227).
- getDLLRegisteredRoutines() now produces its warning correctly when multiple DLLs match, thanks to Matt Dowle’s PR#17184.
- Sys.timezone() now returns non-NA also on platforms such as Ubuntu 14.04.5 LTS, thanks to Mikko Korpela’s PR#17186.
- format(x) for an illegal "POSIXlt" object x no longer segfaults.
- methods(f) now also works for f "(" or "{".
- (Windows only) dir.create() did not check the length of the path to create, and so could overflow a buffer and crash
**R**. (PR#17206) - On some systems, very small hexadecimal numbers in hex notation would underflow to zero. (PR#17199)
- pmin() and pmax() now work again for ordered factors and 0-length S3 classed objects, thanks to Suharto Anggono’s PR#17195 and PR#17200.
- bug.report() did not do any validity checking on a package’s BugReports field. It now ignores an empty field, removes leading whitespace and only attempts to open http:// and https:// URLs, falling back to emailing the maintainer.
- Bandwidth selectors bw.ucv() and bw.SJ() gave incorrect answers or incorrectly reported an error (because of integer overflow) for inputs longer than 46341. Similarly for bw.bcv() at length 5793.
Another possible integer overflow is checked and may result in an error report (rather than an incorrect result) for much longer inputs (millions for a smooth distribution).

- findMethod() failed if the active signature had expanded beyond what a particular package used. (Example with packages XR and XRJulia on CRAN.)
- qbeta() underflowed too early in some very asymmetric cases. (PR#17178)
- R CMD Rd2pdf had problems with packages with non-ASCII titles in ‘.Rd’ files (usually the titles were omitted).

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – R-statistics blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Ramones. Punk is Data, Too

(This article was first published on ** English – R-blog**, and kindly contributed to R-bloggers)

T**he starting point of this post is a simple question**: can we use *R* to analyze punk bands ? And, as a result: what can we learn from applying *data analytics methods* to punk music ?

Whether we like it or not “punk rock is arguably the most important subgenre of music to come out of the ‘70s” and consequently still an integral part of our mainstream contemporary culture. After years of being declared too outrageous to be accepted, its legacy is so astonishingly extensive that it deserves careful consideration and serious attention. Since decades, many music critiques, fine arts experts, social and political scientists or historians of pop culture have devoted time and energy to study the punk scene, its cultural production and legacy, the attitude of the punk generation, its tangle of ideologies, the ways it was perceived and received. Facts and figures, however, are still missing, perhaps because there apparently is nothing more distant from *data analytics* than punk music. So, is *data analytics* of punk rock possible ? Would it make any sense ? My answer is a loud and bold **yes** –yes, statistics on punk rock matters.

Although the punk scene cannot be condensed into a single band, the *Ramones* are still considered by many as the first “pure punk band” and, perhaps –and more importantly–, one of the most influential. This does not imply that other punk rock bands (Clash, Dead Kennedys, The Stooges, Misfits, Sex Pistols, Social Distorsion, Patti Smith Group, etc) are less noteworthy or not as good. Yet, since I need to start somewhere, I decided that my first attempt would focus on the Ramones –which I paradoxically like a lot despite being more of a baroque and classical music person.

What did the Ramones do ?

From 1976 to 1994, the Ramones released 14 studio albums. In their original USA release, the albums comprised 176 different songs in total that were quite short (median: 2M 32S) and mostly written in a *Major* key (only 2 songs are in a *minor* key: Em).

Musical purists always reproached the Ramones for knowing a couple of chords only and making an excessive use of them. Data show that the band knew at least… 11 different chords (out of too-many-to-bother-counting possibilities) although 80% of their songs were built on no more than 6. And there is no evidence of a sophistication of the Ramones’ compositions over time.

Just as the number of different chords in a Ramones’ song is independent from the song writer/s –t.test of *number of different chords ~ writers* don’t allow to exclude alternative hypothesis–, even with each band member having a very distinct personality, according to the biographers.

In terms of *official charts* ranking in the USA, the success of the Ramones fluctuated over their career. The first years of the band were definitely the most successful, from the creation of the band till the early 80’s. Then, from 1985 onwards, it looks like that the sales didn’t follow the strengthening of their reputation not only within but also outside the punk rock scene.

What did the Ramones say ?

Im my dataset, the Ramones’ lyrics come from azlyrics.com. I preferred this source over many other available sources since that website provides the lyrics without the verses repeats, which, in my opinion, would over-emphasise and, ultimately, biais the relevance of n-grams or topics. The dataset (a data frame) contains a *lyrics* variable, i.e. a character string of the track (without the verses repeats) including the *< br>* tags to mark the end of each line.

An example of the *lyrics* variable is like the following:

Hey ho, let s go < br>Hey ho, let s go < br>They re forming in a straight line < br>They re going through a tight wind < br>The kids are losing their minds < br>The Blitzkrieg Bop < br>They re piling in the back seat < br>They re generating steam heat < br>Pulsating to the back beat < br>The Blitzkrieg Bop. < br>Hey ho, let s go < br>Shoot em in the back now < br>What they want, I dont know < br>They re all reved up and ready to go

Tidying the text up (adopting the data principles recommended by Hadley Wickham) is the necessary first step of the lyrics mining exercise. For that, I follow the *tidy text* approach developed by Julia Silge & David Robinson.

First and foremost, it is worth noting that whatever the Ramones say, they say it in very few words ! Ramones songs are brief in time, but also short in lyrics (but not so much in vocabulary with 2,139 different unique words in total).

Whereas uniGrams are usually considered suitable for analysis after expurgation of *stop words*, in the Ramones lyrics the raw uniGrams show an interesting pattern. The 2 most frequent words in the 14 studio albums are **i** and **you**. One could provocatively argue that *Tea for Two*, a well-known 1925 song from Vincent Youmans and Irving Caesar, is a good representation of the Ramones musical universe that seems to be mainly centered on *you* and *i*, and *i* and *you* !

In the uniGrams table below, the columns of the cleaned uniGrams highlight that the top word in the Ramones lyrics is *dont*, expressing an atmosphere of clear *negation*. But there is also a fascinating tension pointing to the future that shows through words such as *wanna*, *gonna* and *ll* (*will* or *shall*). *Rock* and *punk* amongst the top 20 words definitely remind you what type of music you are listening to but also what subculture the band belongs to. In an all-men band, words such as *baby*, *love*, *girl* witness the significance of man-woman relationships in the Ramones songs. Perhaps it took statistical analysis of lyrics to take the risk of forming the hypothesis of the Ramones as a romantic band…

The identification of **most frequent uniGrams per album** is a further step into a more granular analysis:

In addition to identifying the most frequent single words, we could also highlight *when* they are used in the discography using a simple *Token Distribution Analysis*. Let’s limit this exercise to 5 words only from the list of the top 20: *love*, *gonna*, *rock* (or *rocker*), *life* and *dont*.

A quick visualisation of ‘raw’ nGrams (*stop words* not removed) confirms the feeling of a narrative universe mainly focused on **i**, **you** and **negation** (*don’t*).

What did the Ramones feel ?

As a (brief) final chapter of this post, I would like to run a very quick –and limited– *sentiment analysis* of the Ramones’ studio albums lyrics. Actually, rather than a *sentiment analysis*, this is nothing but scratching the surface of *sentiment analysis*. The *bing* sentiment lexicon was used here, but a similar analysis could be carried out using *afinn* or *nrc* lexicons (all available in the *tidytext* r package) or using all of them for a comparative approach.

Although the *sentiment lexicon* gives the word *punk* a negative value, there is little risk in asserting that this is not the way the Ramones intended it.

In order to both fine tune and expand the approach, a more accurate *sentiment analysis* could be undertaken paying attention to 5 additional *tasks* at least:

- in the lyrics, identify the sentiment words preceded or followed by
*not*; - review and, perhaps, amend the sentiment lexicon(s) to better reflect the
*punk rock*subculture; - focus on relative more than absolute frequencies of words;
- add terms’
*inverse document frequency*analysis to measure the impact of the words that are rarely used; - use ML to spot/predict combinations of n-Grams, topics, writers that would “guarantee” a better ranking in the charts.

The **dataset** and **complete R code** of this post can be downloaded from this link.

To **leave a comment** for the author, please follow the link and comment on their blog: ** English – R-blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Missing values imputation with missMDA

(This article was first published on ** François Husson**, and kindly contributed to R-bloggers)

“*The best thing to do with missing values is not to have any*” (Gertrude Mary Cox)

Unfortunately, missing values are ubiquitous and occur for plenty of reasons. One solution is **single imputation** which consists in replacing missing entries with plausible values. It leads to a complete dataset that can be analyzed by any statistical methods.

Based on dimensionality reduction methods, the **missMDA package **successfully imputes large and complex datasets with **quantitative** variables, **categorical** variables and **mixed** variables. Indeed, it imputes data with principal component methods that take into account the similarities between the observations and the relationship between variables. It has proven to be very competitive in terms of quality of the prediction compared to the state of the art methods.

With 3 lines of code, we impute the dataset *orange *available in missMDA:

library(missMDA)

data(orange)

nbdim <- estim_ncpPCA(orange) # estimate the number of dimensions to impute

res.comp <- imputePCA(orange, ncp = nbdim)

In the same way, **imputeMCA** imputes datasets with categorical variables and **imputeFAMD** imputes mixed datasets.

With a completed data, we can pursue our analyses… however we need to be careful and not to forget that the data was incomplete! In a future post, we will see how to visualize the uncertainty on these predicted values.

You can find more information in this JSS paper, on this website, on this tutorial given at useR!2016 at stanford.

You can also watch this playlist on Youtube to practice with R.

You can also enroll in this MOOC.

To **leave a comment** for the author, please follow the link and comment on their blog: ** François Husson**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to do a descriptive analysis using regression modeling?

Freddy Garcia writes:

I read your post Vine regression?, and your phrase “I love descriptive data analysis!” make me wonder: How to do a descriptive analysis using regression models? Maybe my question could be misleading to an statistician, but I am a economics student. So we are accustomed to think in causal terms when we set up a regression model.

My reply: This is a funny question because I think of regression as a predictive tool that will only give causal inferences under strong assumptions. From a descriptive standpoint, regression is an estimate of the conditional distribution of the outcome, y, given the input variables, x. This can be seen most clearly, perhaps, with nonparametric methods such as Bart which operate as a black box: Give Bart the data and some assumptions, run it, and it produces a fitted model, where if you give it new values of x, it will give you probabilistic predictions of y. You can use this for individual data, or you can characterize your entire population in terms of x’s and then do Mister P. (That is, you can poststratify; here Bart is playing the role of the multilevel model.) It’s all descriptive.

I train my students to summarize regression fits using descriptive terminology. So, don’t say “if you increase x_1 by 1 with all the other x’s held constant, then E(y) will change by 0.3.” Instead say, “Comparing two people that differ by 1 in x_1 and who are identical in all the other x’s, you’d predict y to differ by 0.3, on average.” It’s a mouthful but it’s good practice, cos that’s what the regression actually says. I’ve pretty much trained myself to talk that way.

Of course, as Jennifer says, there’s a reason why people use causal language to describe regressions: causal inference is typically what people want. And that’s fine, but then you have to be more careful, you gotta worry about identification, you should control for as many pre-treatment variables as possible but not any post-treatment variables, you should consider treatment interactions, etc.

The post How to do a descriptive analysis using regression modeling? appeared first on Statistical Modeling, Causal Inference, and Social Science.

### Announcing Icarus v0.3

(This article was first published on ** NC233 » R**, and kindly contributed to R-bloggers)

This weekend I released version 0.3.0 of the Icarus package to CRAN.

Icarus provides tools to help perform calibration on margins, which is a very important method in sampling. One of these days I’ll write a blog post explaining calibration on margins! In the meantime if you want to learn more, you can read our course on calibration (in French) or the original paper of Deville and Sarndal (1992). Shortly said, calibration computes new sampling weights so that the sampling estimates match totals we already know thanks to another source (census, typically).

In the industry, one of the most widely used software for performing calibration on margins is the SAS macro Calmar developed at INSEE. Icarus is designed with the typical Calmar user in mind if s/he whishes to find a direct equivalent in R. The format expected by Icarus for the margins and the variables is directly inspired by Calmar’s (wiki and example here). Icarus also provides the same kind of graphs and stats aimed at helping statisticians understand the quality of their data and estimates (especially on domains), and in general be able to understand and explain the reweighting process.

Example of Icarus in RStudio

I hope I find soon the time to finish a full well documented article to submit to a journal and set it as a vignette on CRAN. For now, here are the slides (in French, again) I presented at the “colloque francophone sondages” in Gatineau last october: http://nc233.com/icarus.

Kudos to the CRAN team for their amazing work!

Tweet
To **leave a comment** for the author, please follow the link and comment on their blog: ** NC233 » R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Legal Drinking Age Around the World

As you probably know, different countries have different legal age limits for drinking alcoholic beverages. In the United States, the age is 21. In some places in the world, there is no set age. In most places, the legal age is 18 to drink a non-spirit beverage such as beer in a public place without a guardian.

The map above, based on data from Wikipedia, shows where in the world you’re legally allowed to drink a beer in a public place. It’s slightly generalized and doesn’t take into account that in some places you have to be older to purchase the beverage, but it gives you a good idea of the age limits globally.

This by the way is part of new category I’m calling my sketchbook. I need a place where I can mess around with different formats without worrying about what is the “right” way to do it.

Relevant tutorials: Choropleth Maps and Shapefiles in R / How to Make an Animated Growth Map in R

### Ensemble Learning in R

Previous research in data mining has devised numerous different algorithms for learning tasks. While an individual algorithm might already work decently, one can usually obtain a better predictive by combining several. This approach is referred to as ensemble learning.

Common examples include random forests, boosting and AdaBost in particular.

Our slide deck is positioned at the intersection of teaching the basic idea of ensemble learning and providing practical insights in R.

Therefore, each algorithm comes with an easy-to-understand explanation on how to use it in R.

We hope that the slide deck enables practitioners to quickly adopt ensemble learning for their applications in R. Moreover, the materials might lay the groundwork for courses on data mining and machine learning.

Download the slides here

Download the exercise sheet here

*The content was republished on r-bloggers.com with permission.*