The Latest from Blogs and Journals

Browse recent content from academic blogs and journals

2017-05-15
Methodology Blogs

[cat picture]

In a news article entitled, “No, Wearing Red Doesn’t Make You Hotter,” Dalmeet Singh Chawla recounts the story of yet another Psychological Science / PPNAS-style study (this one actually appeared back in 2008 in Journal of Personality and Social Psychology, the same prestigious journal which published Daryl Bem’s ESP study a couple years later).

Chawla’s article is just fine, and I think these non-replications should continue to get press, as much press as the original flawed studies.

I have just two problem. The first is when Chawla writes:

The issues at hand seem to be the same ones surfacing again and again in the replication crisis—too much weight given to small samples, a tendency to publish positive results and not negative results, and perhaps an unconscious bias from the researchers themselves.

I mean, sure, yeah, I agree with the above paragraph. But there are deeper problems going on. First, any effects being studied are small and highly variable: there are some settings where red will do the trick, and other settings where red will make you less appealing. Color and attractiveness are context-dependent, and it’s just inherently more difficult to study phenomena that are highly variable. Second, the experiment in question used a between-person design, thus making things even noisier (see here for more on this topic). Third, the treatment itself was minimal, of the “priming” variety: the color of a background of a photo that was seen for five seconds. It’s hard enough to appear attractive to someone in real life: we can put huge amounts of effort into the task, and so it’s a bit of a stretch to think that this sort of five-second intervention could do much of anything.

Put it together, and you’re studying a highly variable phenomenon using a minimal treatment, using a statistically inefficient design. The study is dead on arrival. Sure, small samples, the garden of forking paths, and the publication process make it all worse, but there’s no getting around the kangaroo problem. Increase your sample and publish everything, and you still won’t be doing useful science; you’ll just be publishing piles of noise. Better than what was...

R bloggers
2017-05-15
Methodology Blogs

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

In this article we will show how to run a three-way analysis of variance when both the third-order interaction effect and the second-order interaction effects are statistically significant. This type of analysis can become pretty tedious, especially when our factors have many levels, so we will try to explain it here as clearly as possible. (If you want to watch me doing these analyses live, get my free course on statistical analysis with R here.)

First of all, let’s present the fictitious data we are going to work with. Let’s suppose that a pharmaceutical company is planning to launch a new vitamin that allegedly improves the employees’ resistance to effort. The vitamin is tested on a sample of 720 employees, divided into three groups: employees who take a placebo (the control group), employees who take the vitamin in low dose and employees who take the vitamin in high dose. Half of the employees are male, and half are female. Moreover, we have both blue collar employees and white collar employees in our sample.

The resistance to effort is measured on a scale whatsoever, from 1 to 30 (30 being the highest resistance). Our goal is to determine whether the effort resistance is influenced by three factors: dose of vitamin (placebo, low dose, and high dose), gender (male, female) and type of employee (blue collar, white collar). You can find the experiment data in CSV format here.

Third-order interaction effect

First of all, let’s check whether the third-order interaction effect is significant. We are going to run the analysis using the aov function in the stats package (our data frame is called vitamin).

aov1 <- aov(effort~dose*gender*type, data=vitamin) summary(aov1)

In the formula above the interaction effect is, of course, dosegendertype. The ANOVA results can be seen below (we have only kept the line presenting the third-order interaction effect).                                 

                                 Df Sum Sq Mean Sq F value   Pr(>F) dose:gender:type   2    187    93.4  22.367 3.81e-10

The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-...

R bloggers
2017-05-15
Methodology Blogs

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

If you follow me on Twitter or monitor @Rapid7’s Community Blog you know I’ve been involved a bit in the WannaCry ransomworm triage.

One thing I’ve been doing is making charts of the hourly contribution to the Bitcoin addresses that the current/main attackers are using to accept ransom payments (which you really shouldn’t pay, now, even if you are impacted as it’s unlikely they’re actually giving up keys anymore because the likelihood of them getting cash out of the wallets without getting caught is pretty slim).

There’s a full-on CRAN-ified Rbitcoin package but I didn’t need the functionality in it (yet) to do the monitoring. I posted a hastily-crafted gist on Friday so folks could play along at home, but the code here is a bit more nuanced (and does more).

In the spirit of these R⁶ posts, the following is presented without further commentary apart from the interwoven comments.

library(jsonlite) library(hrbrthemes) library(tidyverse) # the wallets accepting ransom payments wallets <- c( "115p7UMMngoj1pMvkpHijcRdfJNXj6LrLn", "12t9YDPgwueZ9NyMgw519p7AA8isjr6SMw", "13AM4VW2dhxYgXeQepoHkHSQuy6NgaEb94" ) # easy way to get each wallet info vs bringing in the Rbitcoin package sprintf("https://blockchain.info/rawaddr/%s", wallets) %>% map(jsonlite::fromJSON) -> chains # get the current USD conversion (tho the above has this, too) curr_price <- jsonlite::fromJSON("https://blockchain.info/ticker") # calculate some basic stats tot_bc <- sum(map_dbl(chains, "total_received")) / 10e7 tot_usd <- tot_bc * curr_price$USD$last tot_xts <- sum(map_dbl(chains, "n_tx")) # This needs to be modified once the counters go above 100 and also needs to # account for rate limits in the blockchain.info API paged <- which(map_dbl(chains, "n_tx") > 50) if (length(paged) > 0) { sprintf("https://blockchain.info/rawaddr/%s?offset=50", wallets[paged]) %>% map(jsonlite::fromJSON) -> chains2 } # We want hourly data across all transactions map_df(chains, "txs") %>% bind_rows(map_df(chains2, "txs")) %>% mutate(xts = anytime::anytime(time), xts = as.POSIXct(format(xts, "%Y-%m-%d %H:00:00"), origin="GMT")) %>% count(xts) -> xdf # Plot it ggplot(...
R bloggers
2017-05-15
Methodology Blogs

Hello R community. if you’re up for some fun tinkering with a Shiny App please join me on a new project. I would love to see some collaboration in designing a Shiny Application which will help people make a decision about a healthcare provider. I have only just begun on this project but would to work with others.

This is just a quick look at the data, the roughest shiny app you’ve ever seen can be located on my shinyapps.io page

The first goal is to help people find a provider based off of City and State (or perhaps zipcode and latitude/longitude). This can take the form of a list, map, etc. I would also like people to be able to glean some information about the place they are going in comparison to the surrounding locations.

I was only able to put a an hour or so into this (and that was months ago) but have decided that it would be fun to start collaborating with anyone who is interested. Please make any pull requests and I’ll get to them!

The data can be found here (supplied by data.gov)

GitHub Repository

Here is a look at the data we’re dealing with after merging it with the zipcodes package!

# I merged it with the zipcode data data = read.csv('../Infections/data/Healthcare_Associated_Infections_-_Hospital.csv') data$zip = clean.zipcodes(data$ZIP.Code) data(zipcode) data=merge(data,zipcode,by.x="zip",by.y="zip")

head(data,3)

## zip Provider.ID Hospital.Name ## 1 00603 400079 HOSP COMUNITARIO BUEN SAMARITANO ## 2 00603 400079 HOSP COMUNITARIO BUEN SAMARITANO ## 3 00603 400079 HOSP COMUNITARIO BUEN SAMARITANO ## Address City State ZIP.Code ## 1 CARR.2 KM.1.4 AVE. SEVERIANO CUEVAS #18 AGUADILLA PR 603 ## 2 CARR.2 KM.1.4 AVE. SEVERIANO CUEVAS #18 AGUADILLA PR 603 ## 3 CARR.2 KM.1.4 AVE. SEVERIANO CUEVAS #18 AGUADILLA PR 603 ## County.Name Phone.Number Measure.Name ## 1 AGUADILLA 7876580000 CAUTI: Observed Cases ## 2 AGUADILLA 7876580000 CAUTI: Predicted Cases ## 3 AGUADILLA 7876580000 CAUTI: Number of Urinary Catheter Days ## Measure.ID Compared.to.National Score ## 1 HAI_2_NUMERATOR Not Available Not Available ## 2 HAI_2_ELIGCASES Not Available Not Available ## 3 HAI_2_DOPC_DAYS Not Available Not Available ## Footnote ## 1 5 - Results are not available for this reporting period. ## 2 5 - Results are not available for this reporting period. ## 3 5 - Results are not available for this reporting period. ## Measure.Start.Date...
R bloggers
2017-05-14
Methodology Blogs

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Intro

The Great Bambino. The Big Unit. Joltin’ Joe. Henry Rowengartner. If you’re familiar with the sport of baseball, you might recognize some of these names from real life or the movies. Since baseball has been engrained in the fabric of America for almost 200 years, and since it is my favorite sport, I decided that I thought it might be fun to take a look back at some of the best players to ever play the game, and see how modern day players stack up against them.

The Motivation

Sports analytics have progressed drastically in recent years, and with the wealth of data available for Major League Baseball, many teams are employing analytics departments to extract value out of statistics. I decided to scrape the hall of fame players on baseball-reference.com to investigate these statistics and determine how good a player has to be in order to be inducted into the hall of fame. Additionally, I took a sample of data from players that have played since 1989 in order to predict whether or not they might be eligible to make the hall of fame.

Extracting the Data

The parent URL I used to extract Hall of Fame player statistics was on Baseball Reference, a baseball database that has all the baseball statistics one could ever want. I had to write two separate spiders to take into account the different statistics used to measure a batter’s statistical output and a pitcher’s statistical output. All in all, there are 163 batters in the baseball hall of fame, which translates to a file of roughly 3500 rows (includes all their seasons played). There are 77 pitchers in the hall of fame, which translates to a file of about 1600 rows (includes all their seasons played). The data for more recent players was downloaded and filtered to include only batters that had over 500 plate appearances per year, and only pitchers who pitched over 150 innings per year in order to normalize numbers.

Insights

First, I took a look at batters. I wanted to get a sense of the distribution of number home runs are hit per season for hall of fame batters, as well as number of hits per season. The histograms look like this:

 

 

Hitting a high number of home runs don’t appear to be a huge indication of making it to...