R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 4 min 16 sec ago

New quantmod and TTR on CRAN

Fri, 2015-07-24 17:04

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)

I just sent quantmod_0.4-5 to CRAN, and TTR_0.23-0 has been there for a couple weeks. I’d like to thank Ivan Popivanov for many useful reports and patches to TTR. He provided patches to add HMA (Hull MA), ALMA, and ultimateOscillator functions.

James Toll provided a patch to the volatility function that uses a zero mean (instead of the sample mean) in close-to-close volatility. The other big change is that moving average functions no longer return objects with column names based on the input object column names. There are many other bug fixes (see the CHANGES file in the package).

The biggest changes in quantmod were to fix getSymbols.MySQL to use the correct dbConnect call based on changes made in RMySQL_0.10 and to fix getSymbols.FRED to use https:// instead of http:// when downloading FRED data. getSymbols.csv also got some much-needed love.

I’d also like to mention that development has moved to GitHub for both TTR and quantmod.

To leave a comment for the author, please follow the link and comment on his blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A Path Towards Easier Map Projection Machinations with ggplot2

Fri, 2015-07-24 15:50

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

The $DAYJOB doesn’t afford much opportunity to work with cartographic
datasets, but I really like maps and tinker with shapefiles and
geo-data when I can, plus answer a ton of geo-questions on
StackOverflow. R makes it easy—one might even say too easy—to work
with maps. All it takes to make a map of the continental United States
(CONUS) is:

library(maps) map("state")

It’s a little more involved with ggplot2:

library(ggplot2) library(ggthemes)   states <- map_data("state")   gg <- ggplot() gg <- gg + geom_map(data=states, map=states, aes(x=long, y=lat, map_id=region), color="black", fill="white", size=0.25) gg <- gg + theme_map() gg

Both of those maps are horrible. The maps::map function defaults to
using a rectangular projection with the aspect ratio chosen so that
longitude and latitude scales are equivalent at the center of the
picture. ggplot will size the graphic to the device window. Sticking
with these defaults is a really bad idea. Why? I’ll let Mark Monmonier
(autho or How to Lie with
Maps
)
explain:

Maps have three basic attributes: scale, projection, and
symbolization. Each element is a source of distortion. As a group,
they describe the essence of the map’s possibilities and limitations.
No one can use maps or make maps safely and effectively without
understanding map scales, map projections, and map symbols.

Map projections distort five geographic relationships: areas, angles,
gross shapes, distances, and directions. Although some projections
preserve local angles but not areas, others preserve areas but not
local angles. All distort large shapes noticeably (but some distort
continental shapes more than others), and all distort at least some
distances and some directions.

There are great examples in his book on how map projections can inflate
or diminish the area and relative importance of countries and regions,
and how a map projection can itself become a rallying point for
“cartographically oppressed” regions.

If map projections are important (and they are) what should you do? Both
map and ggplot give you options to use projections that are found in
the mapproj library (specifically using the mapproject function).
The help for mapproject even gives an example of using the Albers
equal-area conic
projection
when
plotting the CONUS:

library(mapproj) map("state", projection="albers", par=c(lat0=30, lat1=40))

As it’s full name suggests, Albers is as an equal-area projection which
is recommended for U.S. choropleths as it preserves the relative areas
of geographic features. It’s also better than the ggplot default
(“Mercator) in it’s
coord_map:

gg + coord_map()

But, we can pass in parameters to use a saner projection:

gg + coord_map("albers", lat0=30, lat1=40)

The mapproject exposes 41 projections which may seem generous, but not
all of them are practical and even the ones with parameters are not very
customizable. Before we dig into that a bit more, you’re probably
wondering (if you don’t work with cartography for a living or hobby) how
one goes about choosing a map projection…

Choosing A Map Projection

Thankfully, many smart folks have thought quite a bit about choosing map
projections and there are a number of resources you can draw upon when
making your own choices.

The first one is Map Projections – A Working
Manual
. This is a free
resource from the U.S. Geological Survey and was published back in 1987.
It has a “companion” resource – An Album of Map
Projections
. Both are
outstanding resources to have on hand as they provide a great deal of
information on map projections. If you’re in a hurry, the “Album” makes
for a good quick reference. Here’s the entry for our buddy Albers:




(Go to page 10 of the “Album” for the larger version)

The image in the upper-right is a “Tissot indicatrix” (named for it’s
creator Nicolas Auguste Tissot), which “puts Tissot indicatrices at
every 30° of latitude and longitude, except at the poles. This shows the
shape of infinitesimally small circles on the Earth as they appear when
they are plotted by using a fixed finite scale at the same locations on
a map. Every circle is plotted as a circle or an ellipse or, in extreme
cases, as a straight line. On an equal-area projection, all these
ellipses and circles are shown as having the same area. The flattening
of the ellipse shows the extent of local shape distortion and how much
the scale is changed and in what direction. On conformal map
projections, all indicatrices remain circles, but the areas change. On
other projections, both the areas and the shapes of the indicatrices
change”
. This makes for a great way to understand just how your
creations are being distorted (though that may be obvious when you
actually plot your maps).

The “Working Manual” also includes the equations necessary to compute
the projections. Both provide projection usage guidance as well as the
technicaly bits describing them.

The Institute for Environment and
Sustainability
has a similar
guide for Map Projections for
Europe
.

Many countries and municipalities also have their own guides for using
projections (e.g.
California).

If you can handle what feels like a TARDIS trip back in time to the days
of GeoCities, MapRef also provides
good information on projections. You can also review Carlos A. Furuti’s
compilation

of projections for more information.

Using More Complex/Nuanced Projections

Despite being able to use Albers and 40 other projections via
mapproject, having more flexibility would allow us to use grid systems
(see the refs in the previous section for what those are) and also
hyper-customize many projections without the need to write our own
equations (be thankful of that as you skim the math in the “Working
Manual”!).

Gerald Evenden developed a library and utility called proj for the
USGS back in 1995 and said utility, library & projection specification
idiom has been maintained and expanded ever since in the PROJ.4
project
. This library/tool is
straightforward to install on most (if not all) operating systems.
PROJ.4 lets you specify projections in a (fairly complex) string format
(often referred to as a proj4string, especially in R). For example,
you can specify Albers for the U.S. with:

"+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96"

Cartographic Projection Procedures for the UNIX Environment—A User’s
Manual
(considered “the”
manual for PROJ.4) explains how to build the specification. At the time
of this blog post there are 134 named (i.e. +proj=NAME) projections
availble for use & customization (run proj -l at a command-line to see
the full list).

A simple example of why this is useful is when you need to plot a world
map. Said activity should always be prefaced with a review of this
seminal work
so you will choose a good
projection—say, Winkel-Tripel. A quick glance of what mapproject
supports will show you that you’re pretty much out of luck using the
standard R tools we’ve seen so far for that but there is a handy
+proj=wintri in PROJ.4 we can use.

Here’s how you’d plot that projection in the base plotting system:

library(sp) library(rworldmap) # this pkg has waaaay better world shapefiles   plot(spTransform(getMap(), CRS("+proj=wintri")))

However, we can’t pass in PROJ.4 strings to ggplot’s coord_map, so we
have to project the Earth first before plotting and use
coord_equal:

world <- fortify(spTransform(getMap(), CRS("+proj=wintri")))   gg <- ggplot() gg <- gg + geom_map(data=world, map=world, aes(x=long, y=lat, map_id=id), color="black", fill="white", size=0.25) gg <- gg + coord_equal() gg <- gg + theme_map() gg

While that works, you have to pre-project any lines, points, etc. as
well before plotting them with the map. For example, here are the
locations of all of the non-military rocket launch sites (yeah, I’ve done that before, but it’s a fun data set! – and I had it handy):

library(rgdal) # for readOGR library(httr) # for downloading   launch_sites <- paste0("https://collaborate.org/sites/collaborate.org/", "files/spacecentersnon-militaryspacelaunchsites.kml") invisible(try(GET(launch_sites, write_disk("/tmp/launch-sites.kml")), silent=TRUE))   sites <- readOGR("/tmp/launch-sites.kml", "SpaceCenters: Non-Military Space Launch Sites", verbose=FALSE) sites_wt <- spTransform(sites, CRS("+proj=wintri")) sites_wt <- coordinates(sites_wt)   gg <- gg + geom_point(data=as.data.frame(sites_wt), aes(x=coords.x1, y=coords.x2), color="#b2182b") gg

If you have more layers, that’s doable, but painful. If we could make it
so ggplot allow for PROJ.4 projections in some coord_ this would help
keep our plot idioms (and code) cleaner. Thus, coord_proj was born.

coord_proj mimics the functionality of coord_map but uses
proj4::project instead of mapproject (and takes in any of those
parameters). The world map with launch sites now looks like this
(complete ggplot code):

world <- fortify(getMap()) sites <- as.data.frame(coordinates(readOGR("/tmp/launch-sites.kml", "SpaceCenters: Non-Military Space Launch Sites", verbose=FALSE)))   gg <- ggplot() gg <- gg + geom_map(data=world, map=world, aes(x=long, y=lat, map_id=id), color="black", fill="white", size=0.25) gg <- gg + geom_point(data=(sites), aes(x=coords.x1, y=coords.x2), color="#b2182b") gg <- gg + coord_proj("+proj=wintri") gg <- gg + theme_map() gg

Now, if you want to go all Hobo-Dyer instead of Winkel-Tripel, it’s
one, small change:

gg + coord_proj("+proj=cea +lat_ts=37.5")

Spatial Reference is a website by Howard
Butler, Christopher Schmidt, Dane Springmeyer, and Josh Livni that helps
assist others in their understanding, recording, and usage of spatial
reference systems. It’s ultimate goal is to let you use URI’s to specify
spatial reference systems, but you can use it to lookup Proj4 strings to
meet virtually any need you have.

The Bad News

Presently, coord_proj is not part of the ggplot2 code base. I chose
a really bad time to muck with this as Hadley is doing all sorts of
spiffy stuff to ggplot2 and it’s not a good time to shove this in there.

You can fork your own copy of ggplot2 from Hadley’s GitHub
repo
and add this
file
to it,
re-document/collate/build the package and then use it locally. It still
needs some work (there’s a horrible hack in it that anyone familiar with
geo-stuff will cringe at) but I welcome feedback/contributions.
Hopefully this will get into ggplot2 (or out of it and into a separate
package since Hadley is making ggplot2 very extensible in his update) at
some point in the not-too-distant future.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

{Long Vs. Wide} Data Frames

Fri, 2015-07-24 11:19

(This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers)

Introduction

This is an excellent resource to understand 2 types of data frame format: Long and Wide.

  • Just take a look at figure 1 inside the article

1) Long format: ggplot2 needs in certain scenarios this kind of format to work (generally grouped plots).

2) Wide format: On the other hand, usually when you read transnational data, you may find “long-format” and you need it in “wide” in order to create a predictive model.

Here, each row represents a case study, and each column an attribute/variable. Classical input for building a cluster or predictive model.

R Library

The most used library to achieve this is “reshape2″, and, what’s the difference with “reshape”?

Package author said:

“Reshape2 is a reboot of the reshape package. It’s been over five years
since the first release of the package”…”reshape2 uses that knowledge to make a new package for reshaping data that is much more focused and much much faster.”

Happy transforming!

To leave a comment for the author, please follow the link and comment on his blog: R - Data Science Heroes Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R #6 in IEEE 2015 Top Programming Languages, Rising 3 Places

Fri, 2015-07-24 11:05

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

IEEE Spectrum has published its 2015 list of Top Programming Languages, and R ranks in 6th place, jumping 3 places from its 2014 ranking.

Here's what the IEEE has to say about the top 10 from the table above:

The big five—Java, C, C++, Python, and C#—remain on top, with their ranking undisturbed, but C has edged to within a whisper of knocking Java off the top spot. The big mover is R, a statistical computing language that’s handy for analyzing and visualizing big data, which comes in at sixth place. Last year it was in ninth place, and its move reflects the growing importance of big data to a number of fields. 

IEEE Spectrum ranks languages using a weighted score of 12 factors including Google search rankings and trends, social media chatter, aggregator posts (Reddit and Hacker news), social programming activity (GitHub and StackOverflow), job opportunities (Career Builder and Dice) and academic citations. You can also specify your own rankings using this interactive web application (for a charge of $0.99). The application also offers rankings of trending languages (R is #10 in this list), languages in demand by employers (R is #13), and languages popular on open source hubs (R is #10). However you measure, R's ranking is impressive: as a domain-specific language for data science, the fact that it's ranking with general purpose languages like Java, C and Python demonstrates the importance of advanced data analysis in today's world.

IEEE Spectrum: The 2015 Top Ten Programming Languages

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Why I use Panel/Multilevel Methods

Fri, 2015-07-24 11:04

(This article was first published on DiffusePrioR » R, and kindly contributed to R-bloggers)

I don’t understand why any researcher would choose not to use panel/multilevel methods on panel/hierarchical data. Let’s take the following linear regression as an example:

,

where is a random effect for the i-th group. A pooled OLS regression model for the above is unbiased and consistent. However, it will be inefficient, unless for all .

Let’s have a look at the consequences of this inefficiency using a simulation. I will simulate the following model:

,

with and . I will do this simulation and compare the following 4 estimators: pooled OLS, random effects (RE) AKA a multilevel model with a mixed effect intercept, a correlated random effects (CRE) model (include group mean as regressor as in Mundlak (1978)), and finally the regular fixed effects (FE) model. I am doing this in R, so the first model I will use the simple lm() function, the second and third lmer() from the lme4 package, and finally the excellent felm() function from the lfe package. These models will be tested under two conditions. First, we will assume that the random effects assumption holds, the regressor is uncorrelated with the random effect. After looking at this, we will then allow the random effect to correlate with the regressor .

The graph below shows the importance of using panel methods over pooled OLS. It shows boxplots of the 100 simulated estimates. Even when the random effects assumption is violated, the random effects estimator (RE) is far superior to simple pooled OLS. Both the CRE and FE estimators perform well. Both have lowest root mean square errors, even though the model satisfies the random effects assumption. Please see my R code below.

# clear ws rm(list=ls()) # load packages library(lme4) library(plyr) library(lfe) library(reshape) library(ggplot2) # from this: ### set number of individuals n = 200 # time periods t = 5 ### model is: y=beta0_{i} +beta1*x_{it} + e_{it} ### average intercept and slope beta0 = 1.0 beta1 = 5.0 ### set loop reps loop = 100 ### results to be entered results1 = matrix(NA, nrow=loop, ncol=4) results2 = matrix(NA, nrow=loop, ncol=4) for(i in 1:loop){ # basic data structure data = data.frame(t = rep(1:t,n), n = sort(rep(1:n,t))) # random effect/intercept to add to each rand = data.frame(n = 1:n, a = rnorm(n,0,3)) data = join(data, rand, match="first") # random error data$u = rnorm(nrow(data), 0, 1) # regressor x data$x = runif(nrow(data), 0, 1) # outcome y data$y = beta0 + beta1*data$x + data$a + data$u # make factor for i-units data$n = as.character(data$n) # group i mean's for correlated random effects data$xn = ave(data$x, data$n, FUN=mean) # pooled OLS a1 = lm(y ~ x, data) # random effects a2 = lmer(y ~ x + (1|n), data) # correlated random effects a3 = lmer(y ~ x + xn + (1|n), data) # fixed effects a4 = felm(y ~ x | n, data) # gather results results1[i,] = c(coef(a1)[2], coef(a2)$n[1,2], coef(a3)$n[1,2], coef(a4)[1]) ### now let random effects assumption be false ### ie E[xa]!=0 data$x = runif(nrow(data), 0, 1) + 0.2*data$a # the below is like above data$y = beta0 + beta1*data$x + data$a + data$u data$n = as.character(data$n) data$xn = ave(data$x, data$n, FUN=mean) a1 = lm(y ~ x, data) a2 = lmer(y ~ x + (1|n), data) a3 = lmer(y ~ x + xn + (1|n), data) a4 = felm(y ~ x | n, data) results2[i,] = c(coef(a1)[2], coef(a2)$n[1,2], coef(a3)$n[1,2], coef(a4)[1]) } # calculate rmse apply(results1, 2, function(x) sqrt(mean((x-5)^2))) apply(results2, 2, function(x) sqrt(mean((x-5)^2))) # shape data and do ggplot model.names = data.frame(X2 = c("1","2","3","4"), estimator = c("OLS","RE","CRE","FE")) res1 = melt(results1) res1 = join(res1, model.names, match="first") res2 = melt(results2) res2 = join(res2, model.names, match="first") res1$split = "RE Valid" res2$split = "RE Invalid" res1 = rbind(res1, res2) res1$split = factor(res1$split, levels = c("RE Valid", "RE Invalid")) res1$estimator = factor(res1$estimator, levels = c("OLS","RE","CRE","FE")) number_ticks = function(n) {function(limits) pretty(limits, n)} ggplot(res1, aes(estimator, value)) + geom_boxplot(fill="lightblue") + #coord_flip() + facet_wrap( ~ split, nrow = 2, scales = "free_y") + geom_hline(yintercept = 5) + scale_x_discrete('') + scale_y_continuous(bquote(beta==5), breaks=number_ticks(3)) + theme_bw() + theme(axis.text=element_text(size=16), axis.title=element_text(size=16), legend.title = element_blank(), legend.text = element_text(size=16), strip.text.x = element_text(size = 16), axis.text.x = element_text(angle = 45, hjust = 1)) ggsave("remc.pdf", width=8, height=6)

To leave a comment for the author, please follow the link and comment on his blog: DiffusePrioR » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

mapView: basic interactive viewing of spatial data in R

Fri, 2015-07-24 05:10

(This article was first published on metvurst, and kindly contributed to R-bloggers)

Working with spatial data in R I find myself quite often in the need to quickly visually check whether a certain analysis has produced reasonable results. There are two ways I usually do this. Either I:

  1. (sp)plot the data in R and then toggle back and forth between the static plots (I use RStudio) or
  2. save the data to the disk and then open in QGIS or similar to interactively examine the results.

Both these approaches are semi-optimal. Where option 1. is fine for a quick glance at a coarse patterns, it lacks the possibility to have a closer look into the results via zooming and paning. While option 2. provides the interactivity, the detour via the hard disk is annoying (at best), especially when fine-tuning and checking regularly.

I attended this years useR2015! conference in Aalborg (which was marvelous!) and attended the session on interactive graphics in R where Joe Cheng from RStudio presented the leaflet package. Leaflet is great but its rather geared towards manually setting up maps. What a GIS-like functionality would need is some default behaviour for different objects from the spatial universe.

This got me thinking and sparked my enthusiasm to write some wrapper functions for leaflet to provide at least very basic GIS-like interactive graphing capabilities that are directly accessible within RStudio (or the web browser, if you're not using RStudio). So I sat down and wrote a function called mapView().

Unfortunately, it is not possible to present interactive leaflet content here at wordpress.

Therefore, the full article is published at web presentations space at github.

Here's a little sneak preview though.

Enjoy reading!

Tim

To leave a comment for the author, please follow the link and comment on his blog: metvurst. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

CACM Highlights R

Thu, 2015-07-23 23:16

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

The Association for Computing Machinery is the main professional organization for computer science, largely for academia but still with a broad membership. ACM publishes a number of journals, most of them for research but its flagship publication is a magazine, the Communications of the ACM.

The current issue of the CACM includes an article, “Bringing Big Data to the Big Tent,” that is mainly about R and Spark. After discussing the wide usage that R has developed, it raises a question as to whether R, specifically CRAN, is too disorganized. CMU CS professor Jim Herbsleb is quoted as saying

There’s a lot duplication of effort out there, a lot of missed opportunities, where one scientist has developed a tool for him or herself, and with a few tweaks, or if it conformed to a particular standard used a particular data format, it could be useful to a much wider community,

I understand his point, but I strongly disagree. I really like the free-form way that CRAN (and Bioconductor etc.) works, and appreciate the fact that when I need some utility, not only is CRAN likely to have it, but it’s likely to have several versions, by different authors, giving me a lot of choice. Besides, the various libraries available in the CS world include a lot of duplication too, yet no one seems to mind.

I do believe that there should be more structure to CRAN. The Task Views are nice, but are often nowhere near comprehensive, and some tend to be out of date. I’ve also proposed that there be a Yelp-style review system for CRAN packages.

Speaking of CRAN and Spark, the new version of my partools package (which I informally call Snowdoop) went on CRAN a few days ago. I continue to believe that Hadoop and Spark are not appropriate for the majority of R users who work with large data, and I offer partools as one alternative. The package is much more extensive than the last version. I’ll be blogging about it in the near future (and from time to time afterward, with more news and examples) but in the meantime, I recommend reading the vignette for an introduction to usage of the package. See also my recent talk. Note: Although the DESCRIPTION file says that partools requires a Unix-family OS, it should work fine with Windows.

To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A 15-Week Intro Statistics Course Featuring R

Thu, 2015-07-23 20:56

(This article was first published on Quality and Innovation » R, and kindly contributed to R-bloggers)

Morgan at Burning Man 2014. (Image Credit: Nicole Radziwill)

Do you teach introductory statistics or data science? Need some help planning your fall class?

I apply the 10 Principles of Burning Man in the design and conduct of all my undergraduate and graduate-level courses, including my introductory statistics class (which has a heavy focus on R and data science) at JMU. This means that I consider learning to be emergent, and as a result, it often doesn’t follow a prescribed path of achieving specified learning objectives. However, in certain courses, I still feel like it’s important to provide a general structure to help guide the way! This also helps the students get a sense of our general trajectory over the course of the semester, and do readings in advance if they’re ready.

Since several people have asked for a copy, here is the SYLLABUS that I use for my 15-week class (that also uses the “informal” TEXTBOOK I wrote this past spring). We meet twice a week for an hour and 15 minutes each session. The class is designed for undergraduate sophomores, but there are always students from all levels enrolled. The course is intended to provide an introduction to (frequentist) statistical thinking, but with an applied focus that has practical data analysis at its core.

My goal is simple. At the end of the semester, I want students to be able to:

Please let me know if this syllabus is helpful to you! I’ll be posting my intensive (5-session) version of this tomorrow or the next day.

To leave a comment for the author, please follow the link and comment on his blog: Quality and Innovation » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

An alternative presentation of the ProPublica Surgeon Scorecard

Thu, 2015-07-23 19:06

(This article was first published on DataSurg » R, and kindly contributed to R-bloggers)

ProPublica, an independent investigative journalism organisation, have published surgeon-level complications rates based on Medicare data. I have already highlighted problems with the reporting of the data: surgeons are described as having a “high adjusted rate of complications” if they fall in the red-zone, despite there being too little data to say whether this has happened by chance.

This surgeon should not be identified as having a “high adjusted rate of complications” as there are too few cases to estimate the complication rate accurately.

I say again, I fully support transparency and public access to healthcare. But the ProPublica reporting has been quite shocking. I’m not aware of them publishing the number of surgeons out of the 17000 that are statistically different to the average. This is a small handful.

ProPublica could have chosen a different approach. This is a funnel plot and I’ve written about them before.

A funnel plot is a summary of an estimate (such as complication rate) against a measure of the precision of that estimate. In the context of healthcare, a centre or individual outcome is often plotted against patient volume. A horizontal line parallel to the x-axis represents the outcome for the entire population and outcomes for individual surgeons are displayed as points around this. This allows a comparison of individuals with that of the population average, while accounting for the increasing certainty surrounding that outcome as the sample size increases. Limits can be determined, beyond which the chances of getting an individual outcome are low if that individual were really part of the whole population.

In other words, a surgeon above the line has a complication rate different to the average.

I’ve scraped the ProPublica date for gallbladder removal (laparoscopic cholecystectomy) from California, New York and Texas for surgeons highlighted in the red-zone. These are surgeons ProPublica says have high complication rates.

As can be seen from the funnel plot, these surgeons are no where near being outliers. There is insufficient information to say whether any of them are different to average. ProPublica decided to ignore the imprecision with which the complication rates are determined. For red-zone surgeons from these 3 states, none of them have complication rates different to average.

Black line, population average (4.4%), blue line 95% control limit, red line 99.7% control limit.

How likely is it that a surgeon with an average complication rate (4.4%) will appear in the red-zone just by chance (>5.2%)? The answer is, pretty likely given the small numbers of cases here: anything between 15 and 25% chance. Even at the top of the green-zone (low ACR, 3.9%), there is still around a 1 in 6 chance a surgeon will appear to have a high complication rate just by chance.

ProPublica have failed in their duty to explain these data in a way that can be understood. The surgeon score card should be revised. All “warning explanation points” should be removed for those other than the truly outlying cases.

Data

Download

Code

# ProPublica Surgeon Scorecard # https://projects.propublica.org/surgeons # Laparoscopic cholecystectomy (gallbladder removal) data # Surgeons with "high adjusted rate of complications" # CA, NY, TX only # Libraries needed ---- library(ggplot2) library(binom) # Upload dataframe ---- dat = read.csv("http://www.datasurg.net/wp-content/uploads/2015/07/ProPublica_CA_NY_TX.csv") # Total number reported dim(dat)[1] # 59 # Remove duplicate surgeons who operate in more than one hospital duplicates = which(     duplicated(dat$Surgeon) ) dat_unique = dat[-duplicates,] dim(dat_unique) # 27 # Funnel plot for gallbladder removal adjusted complication rate ------------------------- # Set up blank funnel plot ---- # Set control limits pop.rate = 0.044 # Mean population ACR, 4.4% binom_n = seq(5, 100, length.out=40) ci.90 = binom.confint(pop.rate*binom_n, binom_n, conf.level = 0.90, methods = "wilson") ci.95 = binom.confint(pop.rate*binom_n, binom_n, conf.level = 0.95, methods = "wilson") ci.99 = binom.confint(pop.rate*binom_n, binom_n, conf.level = 0.99, methods = "wilson") theme_set(theme_bw(24)) g1 = ggplot()+     geom_line(data=ci.95, aes(ci.95$n, ci.95$lower*100), colour = "blue")+     geom_line(data=ci.95, aes(ci.95$n, ci.95$upper*100), colour = "blue")+     geom_line(data=ci.99, aes(ci.99$n, ci.99$lower*100), colour = "red")+     geom_line(data=ci.99, aes(ci.99$n, ci.99$upper*100), colour = "red")+     geom_line(aes(x=ci.90$n, y=pop.rate*100), colour="black", size=1)+     xlab("Case volume")+     ylab("Adjusted complication rate (%)")+     scale_colour_brewer("", type = "qual", palette = 6)+     theme(legend.justification=c(1,1), legend.position=c(1,1)) g1 g1 +     geom_point(data=dat_unique, aes(x=Volume, y=ACR), colour="black", alpha=0.6, size = 6,                          show_guide=TRUE)+     geom_point(data=dat_unique, aes(x=Volume, y=ACR, colour=State), alpha=0.6, size=4) +     ggtitle("Funnel plot of adjusted complication rate in CA, NY, TX") # Probability of being shown as having high complication rate ---- # At 4.4%, what are the changes of being 5.2% by chance? n &lt;- seq(15, 150, 1) average = 1-pbinom(ceiling(n*0.052), n, 0.044) low = 1-pbinom(ceiling(n*0.052), n, 0.039) dat_prob = data.frame(n, average, low) ggplot(melt(dat_prob, id="n"))+     geom_point(aes(x=n, y=value*100, colour=variable), size=4)+     scale_x_continuous("Case volume", breaks=seq(10, 150, 10))+     ylab("Adjusted complication rate (%)")+     scale_colour_brewer("True complication rate", type="qual", palette = 2, labels=c("Average (4.4%)", "Low (3.9%)"))+     ggtitle("ProPublica chance of being in high complication rate zone bynchance when true complication rate "average" or "low"")+     theme(legend.position=c(1,0), legend.justification=c(1,0))

To leave a comment for the author, please follow the link and comment on his blog: DataSurg » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Revolution R Open 3.2.1 now available

Thu, 2015-07-23 12:52

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The latest update to Revolution R Open, RRO 3.2.1, is now available for download from MRAN. This release upgrades to the latest R engine (3.2.1), enables package downloads via HTTPS by default, and adds new supported Linux platforms.

Revolution R Open 3.2.1 includes: 

You can download Revolution R Open now from the link below, and we welcome comments, suggestions and other discussion on the RRO Google Group. If you're new to Revolution R Open, here are some tips to get started, and there are many data sources you can explore with RRO. Thanks go as always to the contributors to the R Project upon which RRO is built.

MRAN: Download Revolution R Open

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Sunbelt XXXV, Social Network Analysis, Statnet and R

Thu, 2015-07-23 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)

A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:

 

and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.

Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.

And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.

From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book Statistical Analysis of Network Data with R. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.) 

In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more. 

I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the Get Started button. Then select "built-in network" and "ecoli 1" under File type. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph. 

————————————————————————————————————————–

You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Administrative Maps and Projections in R

Thu, 2015-07-23 10:45

(This article was first published on AriLamstein.com » R, and kindly contributed to R-bloggers)

Today I will demonstrate how to create maps of “other countries”, and use map projections, with the choroplethr package in R. I write “other countries” in quotes because, like most things, translating a high level wish into software can be complicated. If you want to skip ahead and play with a web app that lets you explore the Administrative Level 1 maps of every country in the world, and see how they look under various projections, then click here. Here is a screenshot:

The source code for the app is here. You can learn more about Administrative Divisions here and here.

Admininstrative Level 0 Maps

A map that shows countries of the world is frequently called an Administrative Level 0 map. Last year the only Admin 0 map I could find in R listed the USSR as a country, which meant that I couldn’t use it to create choropleth maps of modern data.

I wanted to map World Bank data, so I packaged version 2.0.0 of Natural Earth Data’s Admin 0 map and added it to my choroplethrMaps package.  I did this in the summer of 2014, because I wanted to present this functionality at the 2014 R User Conference. You can see my documentation on making country level choropleths here.  You can see my documentation on mapping World Bank Data here.

Admininstrative Level 1 Maps

A map that shows the borders of states (or state equivalents) is called an Administrative Level 1 map. In December of 2014 I gave a talk at the Japan.R conference, and wanted to include some choropleth maps of Japanese prefecture-level demographic statistics. To do this I packaged Natural Earth Data’s Admin 1 Map  into a new package: choroplethrAdmin1. You can see documentation of that package here.

The app that I link to above uses the choroplethrAdmin1 package to render the Administrative Level 1 map of a country that the user specifies.

Projections

At its core choroplethr uses the ggplot2 library to render maps. This means that, by default, maps are rendered with Cartesian coordinates. This results in maps that “look a bit funny”; virtually all published maps have some form of map projection applied to them, and do not use Cartesian coordinates.

To add a projection to a ggplot2 object use the coord_map() function. The app that I link to above lets you apply 16 different projections to the Admin 1 maps of each country in the world. Note that this dropdown does not include the 18 projections in the mapproj package that require additional parameters. This was done for convenience on my part.

Here is an example showing how projections affect a map of continental US counties. Note that at full zooms the Census Bureau uses an Albers projection with lat0=29.5 and lat1=45.5; at smaller zooms they use a Mercator projection.

library(choroplethr) library(choroplethrMaps) library(ggplot2) library(mapproj) library(gridExtra) data(state.regions) continental_us = state.regions$region[!state.regions$region %in% c("alaska", "hawaii")] data(df_pop_county) no_projection = county_choropleth(df_pop_county, title = "No Projection", state_zoom = continental_us) + theme(legend.position="none") albers_projection = county_choropleth(df_pop_county, title = "Albers Projection", state_zoom = continental_us) + theme(legend.position="none") albers_projection = albers_projection + coord_map(projection = "albers", lat0 = 29.5, lat1 = 45.5) mercator_projection = county_choropleth(df_pop_county, title = "Mercator Projection", state_zoom = continental_us) + theme(legend.position="none") mercator_projection = mercator_projection + coord_map(projection = "mercator") grid.arrange(no_projection, albers_projection, mercator_projection, nrow=3)

Alaska and Hawaii

In the last example the omission of Alaska and Hawaii was deliberate. In fact, when I began creating county choropleths in R I could not find a map that included the counties of Alaska or Hawaii at all. To remedy this I downloaded a map from the Census Bureau myself and put it in the package choroplethrMaps. I needed that map in order to create choropleths of data from the American Community Survey (ACS).

Once I had that map I encountered another problem: what to do with Alaska and Hawaii? For reference, here is how that map looks by default, both with no projection and a Mercator projection

data(county.map) ggplot(county.map, aes(long, lat, group = group)) + geom_polygon() + ggtitle("US County MapnNo Insets, Cartesian Coordinates") ggplot(county.map, aes(long, lat, group = group)) + geom_polygon() + coord_map() + ggtitle("US County MapnNo Insets, Mercator Projection")


Most people are shocked at just how big Alaska is, and just how far south Hawaii is. More importantly, Alaska’s boroughs are so large that it becomes difficult to extract information from the counties in the contiguous 48 states.

If you google something like “ggplot2 map alaska hawaii” you will see a large number of people struggling with how to “move and resize Alaska and Hawaii so that they look normal”. I originally looked to simply transpose them, but my GIS friends recommend that I instead render Alaska and Hawaii separately, and then “attach” them to the map of the continental US. They told me that this is how map “insets” are normally accomplished in the GIS world. I posted for help with this on the ggplot2 mailing list, and wound up using ggplot2’s annotation_custom function. The result is familiar to most users of choroplethr:

county_choropleth(df_pop_county, title="County PopulationnInsets, Cartesian Coordinates", legend="Population")

 

 

Insets and Projections

The above map, of course “looks a little funny”. That is, it uses Cartesian coordinates. In fact, a technical limitation of annotation_custom is that it does not work with projections. When I first released choroplethr this was not an issue; users were simply happy with what the package could provide. But as the number of users increased, more people wanted “normal looking” maps of the 50 states, with insets.

Originally I was not sure how to do this, so I posted for help on the ggplot2 mailing list. Bob Rudis replied with a solution that I think I can apply to choroplethr. I’m currently preparing for a tutorial I am teaching next week, but I hope to tackle this project soon after.
.fca_eoi_form p { width: auto; }#fca_eoi_form_400 input{max-width:9999px;}#fca_eoi_form_400 *{box-sizing:border-box;}#fca_eoi_form_400 div.fca_eoi_form_text_element,#fca_eoi_form_400 input.fca_eoi_form_input_element,#fca_eoi_form_400 input.fca_eoi_form_button_element{display:block;margin:0;padding:0;line-height:normal;font-size:14px;letter-spacing:normal;word-spacing:normal;text-indent:0;text-shadow:none;text-decoration:none;text-transform:none;white-space:normal;width:inherit;height:inherit;background-image:none;border:none;border-radius:0;box-shadow:none;box-sizing:border-box;transition:none;outline:none;-webkit-transition:none;-webkit-appearance:none;-moz-appearance:none;color:#000;font-family:"Open Sans", sans-serif;font-weight:normal;}#fca_eoi_form_400 div.fca_eoi_form_text_element{text-align:center;}#fca_eoi_form_400 *:before,#fca_eoi_form_400 *:after{display:none;}#fca_eoi_form_400 i.fa,#fca_eoi_form_400 i.fa:before{display:block;margin:0;padding:0;line-height:normal;font-size:14px;letter-spacing:normal;word-spacing:normal;text-indent:0;text-shadow:none;text-decoration:none;text-transform:none;white-space:normal;width:inherit;height:inherit;background-image:none;border:none;border-radius:0;box-shadow:none;box-sizing:border-box;transition:none;outline:none;-webkit-transition:none;-webkit-appearance:none;-moz-appearance:none;}#fca_eoi_form_400 div.fca_eoi_layout_popup_close{display:block;margin:0;padding:0;line-height:normal;font-size:14px;letter-spacing:normal;word-spacing:normal;text-indent:0;text-shadow:none;text-decoration:none;text-transform:none;white-space:normal;width:inherit;height:inherit;background-image:none;border:none;border-radius:0;box-shadow:none;box-sizing:border-box;transition:none;outline:none;-webkit-transition:none;-webkit-appearance:none;-moz-appearance:none;color:#000;font-family:"Open Sans", sans-serif;font-weight:normal;display:block;position:absolute;z-index:9999992;top:-10px;right:-10px;background:rgba(0, 0, 0, 0.6);border:1px solid #000;color:#fff;font-weight:bold;width:20px;height:20px;line-height:20px;text-align:center;cursor:pointer;}#fca_eoi_form_400 div.fca_eoi_layout_headline_copy_wrapper{font-weight:bold;}#fca_eoi_form_400 div.fca_eoi_layout_5,#fca_eoi_form_400 form.fca_eoi_layout_5{display:inline-block;}#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget{max-width:300px;}#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox{max-width:600px;}#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup{max-width:650px;}#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_field_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_field_wrapper{float:none;width:100%;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_content_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_content_wrapper{margin:20px;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper{border:solid 1px transparent;width:49%;border-radius:3px;margin-bottom:10px;position:relative;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_name_field_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_name_field_wrapper{float:left;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_email_field_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_email_field_wrapper{float:right;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_inputs_wrapper_no_name div.fca_eoi_layout_field_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_inputs_wrapper_no_name div.fca_eoi_layout_field_wrapper{float:none;width:100%;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper input,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper input,#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper input:focus{border:none !important;width:100%;height:auto;font-size:16px;line-height:1.2em;padding:7px 0;outline:none;background:none !important;box-shadow:none;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_submit_button_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_submit_button_wrapper{clear:both;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_fatcatapps_link_wrapper a{display:block;margin:10px 0 0;font-size:12px;}@media (min-width:1px) and (max-width:450px),(min-height:1px) and (max-height:450px){#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a{font-size:13px !important;}}@media (min-width:1px) and (max-width:320px),(min-height:1px) and (max-height:320px){#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_description_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper i.fa:before,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_field_wrapper div.fca_eoi_layout_field_inner input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_submit_button_wrapper input:focus,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_form_text_element.fca_eoi_layout_privacy_copy_wrapper div,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_popup div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_widget div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 div.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 form.fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_content_wrapper div.fca_eoi_layout_fatcatapps_link_wrapper a{font-size:12px !important;}}@media (min-width:1px) and (max-width:450px),(min-height:1px) and (max-height:450px){#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_content_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_content_wrapper{margin:8px 13px;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_fatcatapps_link_wrapper a{margin:0;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_form_text_element.fca_eoi_layout_headline_copy_wrapper{margin-bottom:5px;}}@media (min-width:1px) and (max-width:320px),(min-height:1px) and (max-height:320px){#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_popup_close,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_popup_close{top:-1px;right:-1px;}}@media (min-width:1px) and (max-width:768px){#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper{float:none;width:100%;}}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_headline_copy_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_headline_copy_wrapper{margin-bottom:20px;}@media (min-width:1px) and (max-width:450px),(min-height:1px) and (max-height:450px){#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_headline_copy_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_headline_copy_wrapper{margin-bottom:0;}}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_inputs_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_inputs_wrapper{margin:20px 0;}@media (min-width:1px) and (max-width:450px),(min-height:1px) and (max-height:450px){#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_inputs_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_inputs_wrapper{margin:8px 0;}}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_field_wrapper{border-radius:5px;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_field_inner,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_field_inner{margin:0 10px;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_submit_button_wrapper,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_submit_button_wrapper{border-bottom:solid 4px transparent;border-radius:5px;padding:0 !important;text-align:center;width:100%;}#fca_eoi_form_400 div.fca_eoi_layout_5 div.fca_eoi_layout_submit_button_wrapper input,#fca_eoi_form_400 form.fca_eoi_layout_5 div.fca_eoi_layout_submit_button_wrapper input{border:0 !important;border-radius:5px;font-weight:bold;margin:0;height:2.8em;padding:0;text-shadow:0 0 2px black;white-space:normal;width:100%;} #fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox{background-color:#f6f6f6 !important;border-color:#ccc !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_headline_copy_wrapper div{font-size:28px !important;color:#1a78d7 !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_name_field_wrapper,#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_name_field_wrapper input{font-size:18px !important;color:#777 !important;background-color:#fff !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_name_field_wrapper{border-color:#ccc !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_email_field_wrapper,#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_email_field_wrapper input{font-size:18px !important;color:#777 !important;background-color:#fff !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_email_field_wrapper{border-color:#ccc !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_submit_button_wrapper input{font-size:18px !important;color:#fff !important;background-color:#e67e22 !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_submit_button_wrapper input:hover{background-color:#d35300 !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_submit_button_wrapper{background-color:#d35300 !important;border-color:#d35300 !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_privacy_copy_wrapper div{font-size:14px !important;color:#8f8f8f !important;}#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_fatcatapps_link_wrapper a,#fca_eoi_form_400 .fca_eoi_layout_5.fca_eoi_layout_postbox div.fca_eoi_layout_fatcatapps_link_wrapper a:hover{color:#8f8f8f !important;}

jQuery.post( "http://www.arilamstein.com/", { 'fca_eoi_track_form_id': 400 } );

LEARN TO MAP CENSUS DATA Subscribe and get my free email course: Mapping Census Data in R!

100% Privacy. We don’t spam. Powered by Optin Cat

if ( typeof fca_eoi === "undefined" ) {fca_eoi = {};}fca_eoi["field_required"] = "Error: This field is required.";fca_eoi["invalid_email"] = "Error: Please enter a valid email address. For example "max@domain.com".";

The post Administrative Maps and Projections in R appeared first on AriLamstein.com.

To leave a comment for the author, please follow the link and comment on his blog: AriLamstein.com » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Waterfall plots – what and how?

Thu, 2015-07-23 10:09

(This article was first published on Design Data Decisions » R, and kindly contributed to R-bloggers)

Waterfall plots” are nowadays often used in oncology clinical trials for a graphical representation of the quantitative response of each subject to treatment. For an informative article explaining waterfall plots see Understanding Waterfall Plots.

In this post, we illustrate the creation of waterfall plots in R.

In a typical waterfall plot, the x-axis serves as the baseline value of the response variable. For each subject, vertical bars are drawn from the baseline, either in the positive or negative direction to depict the change from baseline in the response for the subject. The y-axis thus represents the change from baseline in the response, usually expressed as a percentage, for e.g., percent change in the size of the tumor or percent change in some marker level. Most importantly, in a waterfall plot, the bars are ordered in the decreasing order of the percent change values.

Though waterfall plots have gained popularity in oncology, they can be used for data visualization in other clinical trials as well, where the response is expressed as a change from baseline.

Dataset:

Instead of a tumor growth dataset, we illustrate creation of waterfall plots for the visual depiction of a quality of life data. A quality of life dataset, dataqol2 is available with the R package QoLR.

require(QoLR) ?dataqol2 data(dataqol2) head(dataqol2) dataqol2$id <- as.factor(dataqol2$id) dataqol2$time <- as.factor(dataqol2$time) dataqol2$arm <- as.factor(dataqol2$arm)

dataqol2 contains longitudinal data on scores for 2 quality of life measures (QoL and pain) for 60 subjects. In the case of QoL, higher scores are better since they imply better quality of life, and for pain, lower scores are better since they imply a decrease in pain. Each subject has these scores recorded at baseline (time = 0) and then at a maximum of 5 more time points post baseline. ‘arm’ represents the treatment arm to which the subjects were assigned. The dataset is in long format.

The rest of this post is on creating a waterfall plot in R for the QoL response variable.

Creating a waterfall plot using the barplot function in base R

The waterfall plot is basically an ‘ordered bar chart’, where each bar represents the change from baseline response measure for the corresponding subject.

As the first step, it would be helpful if we change the format of the dataset from ‘long’ to ‘wide’. We use the reshape function to do this. Also, we retain only the QoL scores, but not the pain scores:

qol2.wide <- reshape(dataqol2, v.names="QoL", idvar = "id", timevar = "time", direction = "wide", drop=c("date","pain"))

For each subject, we then find the best (largest) QoL score value post baseline, compute the best percentage change from baseline and order the dataframe in the decreasing order of the best percentage changes. We also remove subjects with missing percent change values:

qol2.wide$bestQoL <- apply(qol2.wide[,5:9], 1 ,function(x) ifelse(sum(!is.na(x)) == 0, NA, max(x,na.rm=TRUE))) qol2.wide$bestQoL.PerChb <- ((qol2.wide$bestQoL-qol2.wide$QoL.0)/qol2.wide$QoL.0)*100 o <- order(qol2.wide$bestQoL.PerChb,decreasing=TRUE,na.last=NA) qol2.wide <- qol2.wide[o,]

Create the waterfall plot… Finally!

barplot(qol2.wide$bestQoL.PerChb, col="blue", border="blue", space=0.5, ylim=c(-100,100), main = "Waterfall plot for changes in QoL scores", ylab="Change from baseline (%) in QoL score", cex.axis=1.2, cex.lab=1.4)

Since we are depicting changes in quality of life scores, the higher the bar is in the positive direction, the better the improvement in the quality of life. So, the above figure shows that, for most subjects, there was improvement in the quality of life post baseline.

We can also color the bars differently by treatment arm, and include a legend. I used the choose_palette() function from the excellent colorspace R package to get some nice colors.

col <- ifelse(qol2.wide$arm == 0, "#BC5A42", "#009296") barplot(qol2.wide$bestQoL.PerChb, col=col, border=col, space=0.5, ylim=c(-100,100), main = "Waterfall plot for changes in QoL scores", ylab="Change from baseline (%) in QoL score", cex.axis=1.2, cex.lab=1.4, legend.text=c(0,1), args.legend=list(title="Treatment arm", fill=c("#BC5A42","#009296"), border=NA, cex=0.9))

Treatment arm 1 is associated with the largest post baseline increases in the quality of life score. Since waterfall plots are basically bar charts, they can be colored by other relevant subject attributes as well.

The above is a solution to creating waterfall plots using base R graphics function barplot. It is my aim to simultaneously also develop a solution using the ggplot2 package (and in the process, develop expertise in ggplot2). So here it is…

Creating a waterfall plot using ggplot2

We use the previously created qol2.wide dataframe, but in ggplot2, we also need an x variable. So:

require(ggplot2) x <- 1:nrow(qol2.wide)

Next we specify some plot settings, we color bars differently by treatment arm and allow the default colors of ggplot2, since I think they are quite nice. We also want to remove the x-axis, and put sensible limits for the y-axis:

b <- ggplot(qol2.wide, aes(x=x, y=bestQoL.PerChb, fill=arm, color=arm)) + scale_fill_discrete(name="Treatmentnarm") + scale_color_discrete(guide="none") + labs(list(title = "Waterfall plot for changes in QoL scores", x = NULL, y = "Change from baseline (%) in QoL score")) + theme_classic() %+replace% theme(axis.line.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_text(face="bold",angle=90)) + coord_cartesian(ylim = c(-100,100))

Finally, the actual bars are drawn using geom_bar(), and we specify the width of the bars and the space between bars. We specify stat="identity" because we want the heights of the bars to represent actual values in the data. See ?geom_bar

b <- b + geom_bar(stat="identity", width=0.7, position = position_dodge(width=0.4))

To leave a comment for the author, please follow the link and comment on his blog: Design Data Decisions » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Call for participation: AusDM 2015, Sydney, 8-9 August

Thu, 2015-07-23 08:21

(This article was first published on blog.RDataMining.com, and kindly contributed to R-bloggers)

*************************************************************
The 13th Australasian Data Mining Conference (AusDM 2015)
Sydney, Australia, 8–9 August 2015
URL: http://ausdm15.ausdm.org/
*************************************************************

The Australasian Data Mining Conference is devoted to the art and science of intelligent data mining: the meaningful analysis of (usually large) data sets to discover relationships and present the data in novel ways that are compact, comprehensible and useful for researchers and practitioners.

This conference will bring together the Data Mining and Business Analytics community researchers and practitioners to share and learn of research and progress in the local context and new breakthroughs in data mining algorithms and their applications.

Keynotes
========

Discovering Negative Links on Social Networking Sites
Prof Huan Liu, Arizona State University

Large Scale Metric Learning using Locality Sensitive Hashing
Prof Ramamohanarao Kotagiri, University of Melbourne

Big Data for Everyone
Prof Jian Pei, Simon Fraser University

Big Data Mining and Data Science
Prof Yong Shi, Chinese Academy of Sciences

Deep Broad Learning – Big Models for Big Data
Prof Geoff Webb, Monash University

Algorithm acceleration for high throughout biology
Prof Wei Wang, University of California, Los Angeles

Big Data Analytics in Business Environments
Prof Hui Xiong, State University of New Jersey

On Mining Heterogeneous Information Networks
Prof Phillip Yu, University of Illinois at Chicago

Resource Management in Cloud Computing Systems
Prof Albert Zomaya, University of Sydney

Big Data Algorithms and Clinical Applications
A/Prof Yixin Chen, Washington University

Defining Data Science
Prof Yangyong Zhu, Fudan University

Learning with Big Data by Incremental Optimization of Performance Measures
Prof Zhi-Hua Zhou, Nanjinf University

Accepted Papers
=================

Research Track:

FSMEC: A Feature Selection Method based on the Minimum Spanning Tree and Evolutionary Computation
Amer Abu Zaher, Regina Berretta, Ahmed Shamsul Arefin and Pablo Moscato

Mining Productive Emerging Patterns and Their Application in Trend Prediction
Vincent Mwintieru Nofong

Detection of Structural Changes in Data Streams
Ross Callister, Mihai Lazarescu and Duc-Son Pham

Multiple Imputation on Partitioned Datasets
Michael Furner and Md Zahidul Islam

Particle Swarm Optimisation for Feature Selection: A Size-Controlled Approach
Bing Xue and Mengjie Zhang

Complement Random Forest
Md Nasim Adnan and Zahid Islam

Aspect-Based Opinion Mining from Product Reviews Using Conditional Random Fields
Amani Samha, Yuefeng Li and Jinglan Zhang

On Ranking Nodes using kNN Graphs, Shortest-paths and GPUs
Ahmed Shamsul Arefin, Regina Berretta and Pablo Moscato

Link Prediction and Topological Feature Importance in Social Networks
Stephan Curiskis, Thomas Osborn and Paul Kennedy

AWST: A Novel Attribute Weight Selection Technique for Data Clustering
Md Anisur Rahman and Md Zahidul Islam

Genetic Programming Using Two Blocks To Extract Edge Features
Wenlong Fu, Mengjie Zhang and Mark Johnston

Designing a knowledge-based schema matching system for schema mapping
Sarawat Anam and Byeong Ho Kang

A Differentially Private Decision Forest
Sam Fletcher and Md Zahidul Islam

Industry Track:

Improving Bridge Deterioration Modelling Using Rainfall Data from the Bureau of Meteorology
Qing Huang, Kok-Leong Ong and Damminda Alahakoon

An Industrial Application of Rotation Forest: Transformer Health Diagnosis
Tamilalagan Natarajan, Duc-Son Pham and Mihai Lazarescu

Non-Invasive Attributes Significance in the Risk Evaluation of Heart Disease Using Decision Tree Analysis
Mai Shouman and Tim Turner

An Improved SMO Algorithm for Credit Risk Evaluation
Jue Wang, Aiguo Lu and Xuemei Jiang

Join us on LinkedIn:
http://www.linkedin.com/groups/AusDM-4907891

To leave a comment for the author, please follow the link and comment on his blog: blog.RDataMining.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

htmltab v.0.6.0

Thu, 2015-07-23 03:50

(This article was first published on Automated Data Collection with R Blog - rstats, and kindly contributed to R-bloggers)

The next version of the htmltab package has just been released on CRAN and GitHub. The goal behind htmltab is to make the collection of structured information from HTML tables as easy and painless as possible (read about the package here and here). The most recent update got rid of many smaller bug fixes, inconsistencies and brings significant internal optimization of the code to increase not only the robustness of the function but also the level of verbosity in case something goes wrong. A complete list of the changes can be checked up here.

install.packages("htmltab") #or devtools::install_github("crubba/htmltab") Header information that appear in the body

With v.0.6.0 a new features has been introduced that will allow users to process header information that appear somewhere in the table body. This is not an uncommon design choice and the question how such tables can be processed with R has been debated on stackoverflow. I illustrate this problem with a table from the American National Weather Service. Below is a crop of this table which should give you the basic idea:

The task is to assemble a table where the the model type information that appear in the body (global models, regional models, …) populate a seperate column in the final table. To this end, htmltab() has been extended to accept in its header argument a formula-like expression to signify the different dimensions of header information. The basic format of this formula interface is level1 + level2 + level3 + … , where you can express the position of each element either numerically or with a character vector for an XPath expression that identifies the respective element. So, for the table above, we pass 1 + "//tr/td[@colspan = '7']" which expresses that the first level header appears in row 1 and the second level headers appear in cells which have a colspan attribute of 7:

nomads <- htmltab(doc = "http://christianrubba.com/htmltab/ex/nomads.html", which = 1, header = 1 + "//tr/td[@colspan = '7']") ## Warning: The code for the HTML table you provided contains invalid table tags ('//trbody'). The following transformations were applied: ## ## //trbody -> //tbody ## ## If you specified an XPath that makes a reference to this tag, this may have caused problems with their identification. ## Warning: The code for the HTML table you provided is malformed. Not all ## cells are nested in row tags (<tr>). htmltab tried to normalize the table ## and ensure that all cells are within row tags. If you specified an XPath ## for body or header elements, this may have caused problems with their ## identification. head(nomads, 22) ## Header_1 Data Set freq ## 3 Global Models GDAS 6 hours ## 4 Global Models GFS 0.25 Degree 6 hours ## 5 Global Models GFS 0.25 Degree (Secondary Parms) 6 hours ## 6 Global Models GFS 0.50 Degree 6 hours ## 7 Global Models GFS 1.00 Degree 6 hours ## 8 Global Models GFS Ensemble high resolution 6 hours ## 9 Global Models GFS Ensemble Precip Bias-Corrected daily ## 10 Global Models GFS Ensemble high-resolution Bias-Corrected 6 hours ## 11 Global Models GFS Ensemble NDGD resolution Bias-Corrected 6 hours ## 12 Global Models NAEFS high resolution Bias-Corrected 6 hours ## 13 Global Models NAEFS NDGD resolution Bias-Corrected 6 hours ## 14 Global Models NGAC 2D Products daily ## 15 Global Models NGAC 3D Products daily ## 16 Global Models NGAC Aerosol Optical Depth Products daily ## 17 Global Models Climate Forecast System Flux Products 6 hours ## 18 Global Models Climate Forecast System 3D Pressure Products 6 hours ## 20 Regional Models AQM Daily Maximum 06Z, 12Z ## 21 Regional Models AQM Hourly Surface Ozone 06Z, 12Z ## 22 Regional Models HIRES Alaska daily ## 23 Regional Models HIRES CONUS 12 hours ## 24 Regional Models HIRES Guam 12 hours ## 25 Regional Models HIRES Hawaii 12 hours ## grib filter http gds-alt ## 3 grib filter http OpenDAP-alt ## 4 grib filter http OpenDAP-alt ## 5 grib filter http <NA> ## 6 grib filter http OpenDAP-alt ## 7 grib filter http OpenDAP-alt ## 8 grib filter http OpenDAP-alt ## 9 grib filter http OpenDAP-alt ## 10 grib filter http OpenDAP-alt ## 11 grib filter http OpenDAP-alt ## 12 grib filter http OpenDAP-alt ## 13 grib filter http OpenDAP-alt ## 14 grib filter http - ## 15 grib filter http - ## 16 grib filter http - ## 17 grib filter http - ## 18 grib filter http - ## 20 grib filter http OpenDAP-alt ## 21 grib filter http OpenDAP-alt ## 22 grib filter http OpenDAP-alt ## 23 grib filter http OpenDAP-alt ## 24 grib filter http OpenDAP-alt ## 25 grib filter http OpenDAP-alt

In the data frame that was produced, we see that the model labels that appeared throughout the body now populate a seperate column. Such a format is almost always more useful for further exploration of the data frame either through summary statistics or visualizations. More generally, there is no limit with respect to the level of nestedness that the table exhibits. The only requirement for this feature to work is that the header levels must be strictly nested and you can specify the exact position of the elements through a numeric vector or an XPath.

For more information have a look at the package vignette. And as always, I am happy to hear about any problems you experience with the package.

To leave a comment for the author, please follow the link and comment on his blog: Automated Data Collection with R Blog - rstats. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Stan 2.7 (CRAN, variational inference, and much much more)

Wed, 2015-07-22 18:23

(This article was first published on Statistical Modeling, Causal Inference, and Social Science » R, and kindly contributed to R-bloggers)

Stan 2.7 is now available for all interfaces. As usual, everything you need can be found starting from the Stan home page:

Highlights
  • RStan is on CRAN!(1)
  • Variational Inference in CmdStan!!(2)
  • Two new Stan developers!!! 
  • A whole new logo!!!! 
  • Math library with autodiff now available in its own repo!!!!! 
(1) Just doing install.packages(“rstan”) isn’t going to work because of dependencies; please go to the RStan getting started page for instructions of how to install from CRAN. It’s much faster than building from source and you no longer need a machine with a lot of RAM to install.

(2) Coming soon to an interface near you.

Full Release Notes v2.7.0 (9 July 2015) ====================================================================== New Team Members -------------------------------------------------- * Alp Kucukelbir, who brings you variational inference * Robert L. Grant, who brings you the StataStan interface Major New Feature -------------------------------------------------- * Black-box variational inference, mean field and full rank (#1505) New Features -------------------------------------------------- * Line numbers reported for runtime errors (#1195) * Wiener first passage time density (#765) (thanks to Michael Schvartsman) * Partial initialization (#1069) * NegBinomial2 RNG (#1471) and PoissonLog RNG (#1458) and extended range for Dirichlet RNG (#1474) and fixed Poisson RNG for older Mac compilers (#1472) * Error messages now use operator notation (#1401) * More specific error messages for illegal assignments (#1100) * More specific error messages for illegal sampling statement signatures (#1425) * Extended range on ibeta derivatives with wide impact on CDFs (#1426) * Display initialization error messages (#1403) * Works with Intel compilers and GCC 4.4 (#1506, #1514, #1519) Bug Fixes -------------------------------------------------- * Allow functions ending in _lp to call functions ending in _lp (#1500) * Update warnings to catch uses of illegal sampling functions like CDFs and updated declared signatures (#1152) * Disallow constraints on local variables (#1295) * Allow min() and max() in variable declaration bounds and remove unnecessary use of math.h and top-level :: namespace (#1436) * Updated exponential lower bound check (#1179) * Extended sum to work with zero size arrays (#1443) * Positive definiteness checks fixed (were > 1e-8, now > 0) (#1386) Code Reorganization and Back End Upgrades -------------------------------------------------- * New static constants (#469, #765) * Added major/minor/patch versions as properties (#1383) * Pulled all math-like functionality into stan::math namespace * Pulled the Stan Math Library out into its own repository (#1520) * Included in Stan C++ repository as submodule * Removed final usage of std::cout and std::cerr (#699) and updated tests for null streams (#1239) * Removed over 1000 CppLint warnings * Remove model write CSV methods (#445) * Reduced generality of operators in fvar (#1198) * Removed folder-level includes due to order issues (part of Math reorg) and include math.hpp include (#1438) * Updated to Boost 1.58 (#1457) * Travis continuous integration for Linux (#607) * Add grad() method to math::var for autodiff to encapsulate math::vari * Added finite diff functionals for testing (#1271) * More configurable distribution unit tests (#1268) * Clean up directory-level includes (#1511) * Removed all lint from new math lib and add cpplint to build lib (#1412) * Split out derivative functionals (#1389) Manual and Documentation -------------------------------------------------- * New Logo in Manual; remove old logos (#1023) * Corrected all known bug reports and typos; details in issues #1420, #1508, #1496 * Thanks to Sunil Nandihalli, Andy Choi, Sebastian Weber, Heraa Hu, @jonathan-g (GitHub handle), M. B. Joseph, Damjan Vukcevic, @tosh1ki (GitHub handle), Juan S. Casallas * Fix some parsing issues for index (#1498) * Added chapter on variational inference * Added strangely unrelated regressions and multivariate probit examples * Discussion from Ben Goodrich about reject() and sampling * Start to reorganize code with fast examples first, then explanations * Added CONTRIBUTING.md file (#1408)

The post Stan 2.7 (CRAN, variational inference, and much much more) appeared first on Statistical Modeling, Causal Inference, and Social Science.

To leave a comment for the author, please follow the link and comment on his blog: Statistical Modeling, Causal Inference, and Social Science » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Le Monde puzzle [#920]

Wed, 2015-07-22 18:15

(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)

A puzzling Le Monde mathematical puzzle (or blame the heat wave):

A pocket calculator with ten keys (0,1,…,9) starts with a random digit n between 0 and 9. A number on the screen can then be modified into another number by two rules:
1. pressing k changes the k-th digit v whenever it exists into (v+1)(v+2) where addition is modulo 10;
2. pressing 0k deletes the (k-1)th and (k+1)th digits if they both exist and are identical (otherwise nothing happens.
Which 9-digit numbers can always be produced whatever the initial digit?

I did not find an easy entry to this puzzle, in particular because it did not state what to do once 9 digits had been reached: would the extra digits disappear? But then, those to the left or to the right? The description also fails to explain how to handle n=000 000 004 versus n=4.

Instead, I tried to look at the numbers with less than 7 digits that could appear, using some extra rules of my own like preventing numbers with more than 9 digits. Rules which resulted in a sure stopping rule when applying both rules above at random:

leplein=rep(0,1e6) for (v in 1:1e6){ x=as.vector(sample(1:9,1)) for (t in 1:1e5){ k=length(x) #as sequence of digits if (k<3){ i=sample(rep(1:k,2),1) x[i]=(x[i]+1)%%10 y=c(x[1:i],(x[i]+1)%%10) if (i<k){ x=c(y,x[(i+1):k])}else{ x=y} }else{ prop1=prop2=NULL difs=(2:(k-1))[abs(x[-(1:2)]-x[-((k-1):k)])==0] if (length(difs)>0) prop1=sample(rep(difs,2),1) if (k<9) prop2=sample(rep(1:k,2),1) if (length(c(prop1,prop2))>1){ if (runif(1)<.5){ x[prop2]=(x[prop2]+1)%%10 y=c(x[1:prop2],(x[prop2]+1)%%10) if (prop2<k){ x=c(y,x[(prop2+1):k])}else{ x=y} }else{ x=x[-c(prop1-1,prop1+1)]} while ((length(x)>1)&(x[1]==0)) x=x[-1]} if (length(c(prop1,prop2))==1){ if (is.null(prop2)){ x=x[-c(prop1-1,prop1+1)] }else{ x[prop2]=(x[prop2]+1)%%10 y=c(x[1:prop2],(x[prop2]+1)%%10) if (prop2<k){ x=c(y,x[(prop2+1):k]) }else{ x=y} x=c(x[1:(prop2-1)], (x[prop2]+1)%%10, (x[prop2]+2)%%10,x[(prop2+1):k])} while ((length(x)>1)&(x[1]==0)) x=x[-1]} if (length(c(prop1,prop2))==0) break() } k=length(x) if (k<7) leplein[sum(x*10^((k-1):0))]= leplein[sum(x*10^((k-1):0))]+1 }}

code that fills an occupancy table for the numbers less than a million over 10⁶ iterations. The solution as shown below (with the number of zero entries over each column) is rather surprising in that it shows an occupancy that is quite regular over a grid. While it does not answer the original question…

Filed under: Books, Kids, R, Statistics, University life Tagged: Le Monde, mathematical puzzle

To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Count data: To Log or Not To Log

Wed, 2015-07-22 10:45

(This article was first published on biologyforfun » R, and kindly contributed to R-bloggers)

Count data are widely collected in ecology, for example when one count the number of birds or the number of flowers. These data follow naturally a Poisson or negative binomial distribution and are therefore sometime tricky to fit with standard LMs. A traditional approach has been to log-transform such data and then fit LMs to the transformed data. Recently a paper advocated against the use of such transformation since it led to high bias in the estimated coefficients. More recently another paper argued that log-transformation of count data followed by LM led to lower type I error rate (ie saying that an effect is significant when it is not) than GLMs. What should we do then?

Using a slightly changed version of the code published in the Ives 2015 paper let’s explore the impact of using these different modelling strategies on the rejection of the null hypothesis and the bias of the estimated coefficients.

#load the libraries library(MASS) library(lme4) library(ggplot2) library(RCurl) library(plyr) #download and load the functions that will be used URL<-"https://raw.githubusercontent.com/Lionel68/Jena_Exp/master/stats/ToLog_function.R" download.file(URL,destfile=paste0(getwd(),"/ToLog_function.R"),method="curl") source("ToLog_function.R")

Following the paper from Ives and code therein I simulated some predictor (x) and a response (y) that follow a negative binomial distribution and is linearly related to x.
In the first case I look at the impact of varying the sample size on the rejection of the null hypothesis and the bias in the estimated coefficient between y and x.

######univariate NB case############ base_theme<-theme(title=element_text(size=20),text=element_text(size=18)) #range over n output<-compute.stats(NRep=500,n.range = c(10,20,40,80,160,320,640,1280)) ggplot(output,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Proportion of Rejected H0")+base_theme ggplot(output,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+labs(x="Sample Size (log-scale)",y="Bias in the slope coefficient")+base_theme

For this simulation round the coefficient of the slope (b1) was set to 0 (no effect of x on y), and only the sample size varied. The top panel show the average proportion of time that the p-value of the slope coefficient was lower than 0.05 (H0:b1=0 rejected). We see that for low sample size (<40) the Negative Binomial model has higher proportion of rejected H0 (type I error rate) but this difference between the model disappear as we reached sample size bigger than 100. The bottom panel show the bias (estimated value – true value) in the estimated coefficient. For very low sample size (n=10), Log001, Negative Binomial and Quasipoisson have higher bias than Log1 and LogHalf. For larger sample size the difference between the GLM team (NB and QuasiP) and the LM one (Log1 and LogHalf) gradually decrease and both teams converged to a bias around 0 for larger sample size. Only Log0001 is behaved very badly. From what we saw here it seems that Log1 and LogHalf are good choices for count data, they have low Type I error and Bias along the whole sample size gradient.

The issue is that an effect of exactly 0 never exist in real life where most of the effect are small (but non-zero) thus the Null Hypothesis will never be true. Let’s look know how the different models behaved when we vary b1 alone in a first time and crossed with sample size variation.

#range over b1 outpu<-compute.stats(NRep=500,b1.range = seq(-2,2,length=17)) ggplot(outpu,aes(x=b1,y=Reject,color=Model))+geom_path(size=2)+base_theme+labs(y="Proportion of Rejected H0") ggplot(outpu,aes(x=b1,y=Bias,color=Model))+geom_path(size=2)+base_theme+labs(y="Bias in the slope coefficient")

Here the sample size was set to 100, what we see in the top graph is that for a slope of exactly 0 all model have a similar average proportion of rejection of the null hypothesis. As b1 become smaller or bigger the average proportion of rejection show very similar increase for all model expect for Log0001 which has a slower increase. This curves basically represent the power of the model to detect an effect and is very similar to the Fig2 in the Ives 2015 paper. Now the bottom panel show that all the LM models have bad behaviour concerning their bias, they have only small bias for very small (close to 0) coefficient, has the coefficient gets bigger the absolute bias increase. This means that even if the LM models are able to detect an effect with similar power the estimated coefficient is wrong. This is due to the value added to the untransformed count data in order to avoid -Inf for 0s. I have no idea on how one may take into account arithmetically these added values and then remove its effects …

Next let’s cross variation in the coefficient with sample size variation:

#range over n and b1 output3<-compute.stats(NRep=500,b1.range=seq(-1.5,1.5,length=9),n.range=c(10,20,40,80,160,320,640,1280)) ggplot(output3,aes(x=n,y=Reject,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Proportion of rejected H0") ggplot(output3,aes(x=n,y=Bias,color=Model))+geom_path(size=2)+scale_x_log10()+facet_wrap(~b1)+base_theme+labs(x="Sample size (log-scale)",y="Bias in the slope coefficient")

The tope panel show one big issue of focussing only on the significance level: rejection of H0 depend not only on the size of the effect but also on the sample size. For example for b1=0.75 (a rather large value since we work on the exponential scale) less than 50% of all models rejected the null hypothesis for a sample size of 10. Of course as the effect sizes gets larger the impact of the sample size on the rejection of the null hypothesis is reduced. However most effect around the world are small so that we need big sample size to be able to “detect” them using null hypothesis testing. The top graph also shows that NB is slightly better than the other models and that Log0001 is again having the worst performance. The bottom graphs show something interesting, the bias is quasi-constant over the sample size gradient (maybe if we would look closer we would see some variation). Irrespective of how many data point you will collect the LMs will always have bigger bias than the GLMs (expect for the artificial case of b1=0)

To finish with in Ives 2015 the big surprise was the explosion of type I error with the increase in the variation in individual-level random error (adding a random normally distributed value to the linear predictor of each data point and varying the standard deviation of these random values) as can be seen in the Fig3 of the paper.

#range over b1 and sd.eps output4<-compute.statsGLMM(NRep=500,b1.range=seq(-1.5,1.5,length=9),sd.eps.range=seq(0.01,2,length=10)) ggplot(output4,aes(x=sd.eps,y=Reject,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="",y="Proportion of rejected H0") ggplot(output4,aes(x=sd.eps,y=Bias,color=Model))+geom_path(size=2)+facet_wrap(~b1)+base_theme+labs(x="Standard Deviation of the random error",y="Bias of the slope coefficient")

Before looking at the figure in detail please note that a standard deviation of 2 in this context is very high, remember that these values will be added to the linear predictor which will be exponentiated so that we will end up with very large deviation. In the top panel there are two surprising results, the sign of the coefficient affect the pattern of null hypothesis rejection and I do not see the explosion of rejection rates for NB or QuasiP that are presented in Ives 2015. In his paper Ives reported the LRT test for the NB models when I am reporting the p-values from the model summary directly (Wald test). If some people around have computing power it would be interesting to see if changing the seed and/or increasing the number of replication lead to different patterns … The bottom panel show again that the LMs bias are big, the NB and QuasiP models have an increase in the bias with the standard deviation but only if the coefficient is negative (I suspect some issue with the exponentiating of large random positive error), as expected the GLMM perform the best in this context.

Pre-conclusion, in real life of course we would rarely have a model with only one predictor, most of the time we will build larger models with complex interaction structure between the predictors. This will of course affect both H0 rejection and Bias, but this is material for a next post

Let’s wrap it up; we’ve seen that even if LM transformation seem to be a good choice for having a lower type I error rate than GLMs this advantage will be rather minimal when using empirical data (no effect are equal to 0) and potentially dangerous (large bias). Ecologists have sometime the bad habits to turn their analysis into a star hunt (R standard model output gives stars to significant effects) and focusing only on using models that have the best behavior towards significance (but large bias) does not seem to be a good strategy to me. More and more people call for increasing the predictive power of ecological model, we need then modelling techniques that are able to precisely (low bias) estimate the effects. In this context transforming the data to make them somehow fit normal assumption is sub-optimal, it is much more natural to think about what type of processes generated the data (normal, poisson, negative binomial, with or without hierarchical structure) and then model it accordingly. There are extensive discussion nowadays about the use and abuse of p-values in science and I think that in our analysis we should slowly but surely shifts our focus from “significance/p-values<0.05/star hunting” only to a more balanced mix of effect-sizes (or standardized slopes), p-values and R-square.

Filed under: Biological Stuff, R and Stat Tagged: ecology, GLM, LM, R, Statistics

To leave a comment for the author, please follow the link and comment on his blog: biologyforfun » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New R Markdown articles section, plus .Rmd to .docx super powers!

Wed, 2015-07-22 10:45

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’ve added a new articles section to the R Markdown development center at rmarkdown.rstudio.com/articles.html. Here you can find expert advice and tips on how to use R Markdown efficiently.

In one of the first articles, Richard Layton of graphdoctor.com explains the best tips for using R Markdown to generate Microsoft Word documents. You’ll learn how to

  • set Word styles
  • tweak margins
  • handle relative paths
  • make better tables
  • add bibliographies, and more

Check it out; and then check back often.

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New caret Version (6.0-52)

Wed, 2015-07-22 10:42

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

A new version of caret (6.0-52) is on CRAN.

Here is the news file but the Cliff Notes are:

  • sub-sampling for class imbalances is now integrated with train and is used inside of standard resampling. There are four methods available right now: up- and down-sampling, SMOTE, and ROSE. The help page has detailed information.
  • Nine additional models were added, bringing the total up to 192.
  • More error traps were added for common mistakes (e.g. bad factor levels in classification).
  • Various bug fixes and snake enhancements

On-deck for upcoming versions:

  • An expanded interface for preprocessing. You might want to process some predictors one way and others differently (or not at all). A new interface will allow for this but should maintain backwards compatibility (I hope)
  • Censored data models. Right now we are spec’ing out how this will work but the plan is to have models for predicting the outcome directly as well as models that predict survivor function probabilities. Email me (max.kuhn@pfizer.com) if you have an interest in this.
  • Enabling prediction intervals (for models that support this) using predict.train. To be clear, caret isn’t generating these but if you fit an lm model, you will be able to generate intervals from predict.lm and pass them through predict.train.

Max

To leave a comment for the author, please follow the link and comment on his blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs