R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 5 hours 52 min ago

Win a FREE ticket to EARL London!

Mon, 2015-08-17 12:35

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Competition Time!

Mango Solutions are offering an opportunity to win a FREE two day Conference Pass* to the EARL (Effective Application of the R Language) Conference in London on the 14th-16th September.  

*Value £545

To Enter:

Email earl-team@mango-solutions.com with your name and a one sentence reply to the following question:

Why I use (or am considering using) R? Competition Deadline: Midnight Sunday 23rd August

For full details of the EARL Conference click here

To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Playing with R, Shiny Dashboard and Google Analytics Data

Mon, 2015-08-17 11:55

(This article was first published on analytics for fun, and kindly contributed to R-bloggers)

In this post, I want to share some examples of data visualization I was playing with recently. Like in many other occasions, my field of application is digital analytics data. Precisely, data from Google Analytics.

You might remember a previous post where I built a tentative dashboard using R, Shiny and Google Charts. The final result was not too bad, however the layout was somewhat too rigid since I was using the command “merge” to merge the charts and create the final dashboard.

So, I thought to spend some time improving my previous dashboard and include a couple of new visualizations, which will be hopefully inspiring. Of course, I am still using R, Shiny, and in particular shinydashboard: an ad hoc package to build dashboard with R.

The dashboard I’ve made makes use of the following visualizations:

  • Value boxes
  • Interactive Time Series (dygraphs)
  • Bubble charts
  • Streamgraphs
  • Treemaps 

You can see the final dashboard at shinyapps.io (though, because of basic plan current limits, it might be temporarily unavailable), or better you can check the code at github. Here is a screenshot:

Let’s go quickly through each visualization to see what Google Analytics dimension/metrics it shows.

Value Boxes

When you build a dashboard, boxes are probably the main building blocks since they allow organize the information you want to show within the page. When I build a dashboard, I normally start by sketching the layout, and this means placing the main boxes.

A particular type of box available in the Shiny Dashboard package is the valueBox, which lets you display  numeric or text values, and also add an icon. Value boxes are great components to be placed at the top of a dashboard and display main KPI’s, change % or add a description to the rest of the dashboard.

In my dashboard I placed 3 boxes at the top, showing the value for my 3 main KPI’s: sessions to the website, transactions (conversions) and conversion rate. The code to build a value box with shiny dashboard is very simple and if you want to have dynamic values, like in my case, you have to create in both the server.R and ui.R section of your Shiny app:

Interactive Time Series (dygraphs)

Time series charts might get chaotics and not provide clear insights when filled with too many data and series (you might end up with the so called “spaghetti-effect”).

But if time series are interactive, user can easily explore and make sense of complex datasets.

For example, users could highlight specific data points, include/exclude time series, zoom in specific time intervals, enrich the graph with shaded regions or annotations, etc. All of these features are offered by the dygraphs Javascript charting library.

I used the R dygraph package (which provides an interface to the Javascript dygraph charting library) to make an interactive time series with my Google Analytics dataset. The simple chart I made shows 3 metrics: sessions, transactions and conversion rate (of those transactions) over the period selected by the user. Both sessions and transactions use the left axis while conversion rate the right one. I included a dyRangeSelector placed at the bottom of the chart that lets you narrow down the time interval.

Bubble charts

With bubble charts you can show three dimensions of data. I used a bubble chart to visualize the performance of traffic channels: x axis represents the number of sessions, y axis thee avg. pages per session, and finally transactions (that is the ultimate objective of many websites) are proportional to the size of the bubble. The larger the bubble and the higher is the number of transactions  produced by that channel of traffic.

To make this chart I used the GoogleVis package.

In the dashboard I’ve also included a one-dimensional bubble chart using the bubbles library. This type of chart works similar to a bar chart though the latter is more accurate in terms of understanding the real value you are showing.

On the other hand, this bubble chart might look more attractive than bar charts and it allows to display lots of values in a small area. I used this chart to show screen resolutions data from Google Analytics mobile reports.

Streamgraphs 

Streamgraphs are a type of stacked area charts that are displaced around a central horizontal axis. Stremgraphs are very effective to visualize data series that varies over time, especially if you need to show many categories.

The result is a flowing, organic shape, with strong aesthetic appeal, which is why streamgraphs are becoming more and more popular.

In the dashboard I made a streamgraph to visualize the evolution of sessions among devices (desktop, mobile, tablet) over the past years. To do it in R. I had to play a bit with the streamgraph package.

Here below is the final data viz (I am not completely happy with this visualization as for some reason when I mouse over the series the value showed is always the total of the period, not the one of the specific date I am pointing on. Any help?).

Another interesting application on web analytics data,  would be using streamgraphs to analyse channels share of traffic over time (direct vs organic vs paid vs referral, etc.).

Treemaps

Treemap visualizations are very effective in showing hierarchical (tree-structured) data in a compact way. They can display lot of information within a limited space and at the same allow users to drilldown into the represented segments.

An example of hierarchical data in Google Analytics reports, is devices as principal segment (main rectangles) and browser as sub-segment (nested rectangles). The area of each rectangle is proportional to the amount of sessions produced by its corresponding segment/sub-segment.

To make in R, I used the treemap library (unfortunately the visualization is not interactive, but you can have a try with the d3treeR library).

I hope you can get inspiration from these visualizations and include some of them in your digital analytics dashboard or reports. My plan is to keep adding more interesting visualizations (that are not currently offered in Google Analytics reports) to this dashboard, to better show digital data. If you have suggestions please leave a comment here or share it via github repo.

To leave a comment for the author, please follow the link and comment on his blog: analytics for fun. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Streamgraphs in R

Fri, 2015-07-31 13:51

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It’s not easy to visualize a quantity that varies over time and which is composed of more than two subsegments. Take, for example, this stacked bar chart of religious affiliation of the Australian population, by time:

While it’s easy to see the how the share of Anglicans (at the bottom of the chart) has changed over time, it’s much more difficult to assess the change in the “No religion” category: the separated bars coupled with the (necessarily) uneven positioning makes it hard to judge changes from year to year.

There’s no easy solution, but one increasingly popular idea is to use streamgraphs, which connect categories together into continuous polygons, and can even be aligned to the middle so that no one category gets favoured status (like the Anglican category above). Here’s a pioneering example of streamgraphs from the New York Times in 2008:

If you click the image above, you’ll see that one nice feature is that you can hover your mouse over a segment and see it highlighted, which makes it a little easier to observe changes over time within a segment.

You can easily make interactive streamgraphs like this in R, with the streamgraph package, available on GitHub. The streamgraph function makes use of on htmlwidgets, and has a ggplot2-style object interface which makes it easy to create and customize your chart. There’s a nice demo on RPubs, from which I took this example:

stocks %>% mutate(date=as.Date(quarter, format="%m/%d/%y")) %>% streamgraph(key="ticker", value="nominal", offset="expand") %>% sg_fill_manual(stock_colors) %>% sg_axis_x(tick_interval=10, tick_units="year") %>% sg_legend(TRUE, "Ticker: ")

The resulting streamgraph is shown below. (Update: Thanks to Bob Rudis in the comments for the tip on embedding htmlwidgets into blog posts.)

To learn more about streamgraphs in R, check out the blog post linked below.

rud.is: Introducing the streamgraph htmlwidget R Package

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

15 Questions All R Users Have About Plots

Thu, 2015-07-30 11:40

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

R allows you to create different plot types, ranging from the basic graph types like density plots, dot plots, bar charts, line charts, pie charts, boxplots and scatter plots, to the more statistically complex types of graphs such as probability plots, mosaic plots and correlograms.

In addition, R is pretty known for its data visualization capabilities: it allows you to go from producing basic graphs with little customization to plotting advanced graphs with full-blown customization in combination with interactive graphics. Nevertheless, not always do we get the results that we want for our R plots: Stack Overflow is flooded with our questions on plots, with many recurring questions.

This is why DataCamp decided to put all the frequently asked questions and their top rated answers together in a blog post, completed with additional, short explanations that could be of use for R beginners.

If you are rather interested in learning how to plot with R, you might consider reading our tutorial on histograms, which covers basic R, ggplot2 and ggvis, or this shorter tutorial which offers an overview of simple graphs in R. However, if you’re looking to learn everything on creating stunning and informative graphical visualizations, our interactive course on (interactive) data visualization with ggvis will definitely interest you!

1. How To Draw An Empty R Plot? How To Open A New Plot Frame

You can open an empty plot frame and activate the graphics device in R as follows:

plot.new() # or frame()

Note that the plot.new() and frame() functions define a new plot frame without it having any axes, labels, or outlining. It indicates that a new plot is to be made: a new graphics window will open if you don’t have one open yet, otherwise the existing window is prepared to hold the new plot. You can read up on these functions here.

  • x11() can also opens a new windows device in R for the X Window System (version 11)!
  • quartz() starts a graphics device driver for the OS X System.
  • windows() starts a new graphics device for Windows.
How To Set Up The Measurements Of The Graphics Window

You can also use the plot.window() function to set the horizontal and vertical dimensions of the empty plot you just made:

pWidth = 3 pHeight = 2 plot.window(c(0,pWidth), c(0,pHeight)) How To Draw An Actual Empty Plot

You can draw an empty plot with the plot() function:

plot(5, 5, type="n", axes=FALSE, ann=FALSE, xlim=c(0, 10), ylim = c(0,10))

You give the coordinates (5,5) for the plot, but that you don’t show any plotting by putting the type argument to "n".

(Tip: try to put this argument to "p" or "b" and see what happens!)

What’s more, you don’t annotate the plot, but you do put limits from 0 to 10 on the x- and y-axis. Next, you can fill up your empty plot with axes, axes labels and a title with the following commands:

mtext("x-axis", side=1) #Add text to the x-axis mtext("y-axis", side=2) title("An R Plot") #Add a title

Note that if you want to know more about the side argument, you can just keep on reading! It will be discussed in more detail below, in question 3 about R plots.

Lastly, you may choose to draw a box around your plot by using the box() function and add some points to it with the points() function:

box() #Draw a box points(5, #Put (red) point in the plot at (5,5) 5, col="red") points(5, 7, col="orange", pch=3, cex=2) points(c(0, 0, 1), c(2, 4, 6), col="green", pch=4)

Note that you can put your x-coordinates and y-coordinates in vectors to plot multiple points at once. The pch argument allows you to select a symbol, while the cex argument has a value assigned to it that indicates how much the plotted text and symbols should be scaled with respect to the default.

Tip: if you want to see what number links to what symbol, click here.

2. How To Set The Axis Labels And Title Of The R Plots?

The axes of the R plots make up one of the most popular topics of Stack Overflow questions; The questions related to this topic are very diverse. Keep on reading to find out what type of questions DataCamp has found to be quite common!

How To Name Axes (With Up- Or Subscripts) And Put A Title To An R Plot?

You can easily name the axes and put a title in place to make your R plot more specific and understandable for your audience.

This can be easily done by adding the arguments main for the main title, sub for the subtitle, xlab for the label of the x-axis and ylab for the label of the y-axis:

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, main="main title", sub="sub-title", xlab="x-axis label", ylab="y-axis label")

Note that if you want to have a title with up-or subscripts, you can easily add these with the following commands:

plot(1, 1, main=expression("title"^2)) #Upscript plot(1, 1, main=expression("title"[2])) #Subscript

This all combined gives you the following plot:

Good to know: for those who want to have Greek letters in your axis labels, the following code can be executed:

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, xlab = expression(paste("Greek letter ", phi)), ylab = expression(paste("Greek letter ",mu))) How To Adjust The Appearance Of The Axes’ Labels

To adjust the appearance of the x-and y-axis labels, you can use the arguments col.lab and cex.lab. The first argument is used to change the color of the x-and y-axis labels, while the second argument is used to determine the size of the x-and y-axis labels, relative to the (default) setting of cex.

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, main=expression("main title"^2), sub=expression("sub-title"[2]), xlab="x-axis label", ylab="y-axis label", col.lab="blue", cex.lab=0.75)

For more information on these arguments, go to this page.

How To Remove A Plot’s Axis Labels And Annotations

If you want to get rid of the axis values of a plot, you can first add the arguments xaxt and yaxt, set as "n". These arguments are assigned a character which specifies the x-axis type. If you put in an "n", like in the command below, you can suppress the plotting of the axis.

Note that by giving any other character to the xaxt and yaxt arguments, the x-and y-axes are plotted.

Next, you can add the annotation argument ann and set it to FALSE to make sure that any axis labels are removed.

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, xaxt="n", yaxt="n", ann=FALSE)

Tip: not the information you are looking for? Go to this page.

How To Rotate A Plot’s Axis Labels

You can add the las argument to the axis() function to rotate the numbers that correspond to each axis:

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, axes=FALSE) box() axis(2, las=2) axis(1, las=0)

Note how this actually requires you to add an argument to the plot() function that basically says that there are no axes to be plotted (yet): this is the task of the two axis() function that come next.

The las argument can have three values attributed to it. According to whichever option you choose, the placement of the label will differ: if you choose 0, the label will always be parallel to the axis (which is the default); If you choose 1, the label will be put horizontally. Pick 2 if you want it to be perpendicular to the axis and 3 if you want it to be placed vertically.

But there is more. If you want to know more about the possibilities of the axis() function, keep on reading!

How To Move The Axis Labels Of Your R Plot

So, you want to move your axes’ labels around?

No worries, you can do this with the axis() function; As you may have noticed before in this tutorial, this function allows you to first specify where you want to draw the axes. Let’s say you want to draw the x-axis above the plot area and the y-axis to the right of it.

Remember that if you pass 1 or 2 to the axis() function, your axis will be drawn on the bottom and on the left of the plot area. Scroll a bit up to see an example of this in the previous piece of code!

This means that you will want to pass 3 and 4 to the axis() function:

x<-seq(0,2*pi,0.1) y<-sin(x) plot(x, y, axes=FALSE, # Do not plot any axes ann=FALSE) # Do not plot any annotations axis(3) # Draw the x-axis above the plot area axis(4) # Draw the y-axis to the right of the plot area box()

As you can see, at first, you basically plot x and y, but you leave out the axes and annotations. Then, you add the axes that you want to see and specify their location with respect to the plot.

The flexibility that the axis() function creates for you keeps on growing! Check out the next frequently asked question to see what else you can solve by using this basic R function.

Tip: go to the last question to see more on how to move around text in the axis labels with hjust and vjust!

3. How To Add And Change The Spacing Of The Tick Marks Of Your R Plot How To Change The Spacing Of The Tick Marks Of Your R Plot

Letting R determine the tick marks of your plot can be quite annoying and there might come a time when you will want to adjust these.

1. Using The axis() Function To Determine The Tick Marks Of Your Plot

Consider the following piece of code:

v1 <- c(0,pi/2,pi,3*pi/2,2*pi) # -> defines position of tick marks. v2 <- c("0","Pi/2","Pi","3*Pi/2","2*Pi") # defines labels of tick marks. x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, xaxt = "n") axis(side = 1, at = v1, labels = v2, tck=-.05)

As you can see, you first define the position of the tick marks and their labels. Then, you draw your plot, specifying the x axis type as "n", suppressing the plotting of the axis.

Then, the real work starts:

  • The axis() function allows you to specify the side of the plot on which the axis is to be drawn. In this case, the argument is completed with 1, which means that the axis is to be drawn below. If the value was 2, the axis would be drawn on the left and if the value was 3 or 4, the axis would be drawn above or to the right, respectively;
  • The at argument allows you to indicate the points at which tick marks are to be drawn. In this case, you use the positions that were defined in V1;
  • Likewise, the labels that you want to use are the ones that were specified in V2;
  • You adjust the direction of the ticks through tck: by giving this argument a negative value, you specify that the ticks should appear below the axis.

Tip: try passing a positive value to the tck argument and see what happens!

You can further specify the size of the ticks through tcl and the appearance of the tick labels is controlled with cex.axis, col.axis and font.axis.

v1 <- c(0,pi/2,pi,3*pi/2,2*pi) # -> defines position of tick marks. v2 <- c("0","Pi/2","Pi","3*Pi/2","2*Pi") # defines labels of tick marks. x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, xaxt = "n") axis(side = 1, at = v1, labels = v2, tck=-.1, tcl = -0.5, cex.axis=1.05, col.axis="blue", font.axis=5)

2. Using Other Functions To Determine The Tick Marks Of Your R Plot

You can also use the par() and plot() functions to define the positions of tickmarks and the number of intervals between them.

Note that then you use the argument xaxp to which you pass the position of the tick marks that are located at the extremes of the x-axis, followed by the number of intervals between the tick marks:

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, xaxp = c(0, 2*pi, 5))

# Or x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, xaxt="n") par(xaxp= c(0, 2*pi, 2)) axis(1)

Note that this works only if you use no logarithmic scale. If the log coordinates set as true, or, in other words, if par(xlog=T), the three values that you pass to xaxp have a different meaning: for a small range, n is negative. Otherwise, n is in 1:3, specifying a case number, and x1 and x2 are the lowest and highest power of 10 inside the user coordinates, 10 ^ par(“usr”)[1:2].

An example:

n <- 1 x <- seq(0, 10000, 1) y <- exp(n)/(exp(n)+x) par(xlog=TRUE, xaxp= c(1, 4, 3)) plot(x, y, log="x")


In this example, you use the par() function: you set xlog to TRUE and add the xaxp argument to give the coordinates of the extreme tick marks and the number of intervals between them. In this case, you set the minimal value to 1, the maximal value to 4 and you add that the number of intervals between each tick mark should be 3.

Then, you plot x and y, adding the log argument to specify whether to plot the x-or y-axis or both on a log scale. You can pass "x", "y", and "xy" as values to the log arguments to do this.

An example with both axes in logarithmic scale is:

n <- 1 x <- seq(0, 20, 1) y <- exp(x)/(x) par(xlog=TRUE, xaxp= c(1, 4, 3)) par(ylog=TRUE, yaxp= c(1, 11, 2)) plot(x, y, log="xy")

How To Add Minor Tick Marks To An R Plot

You can quickly add minor tick marks to your plot with the minor.tick() function from the Hmisc package:

plot.new() library(Hmisc) minor.tick(nx = 1.5, ny = 2, tick.ratio=0.75)
  • The nx argument allows you to specify the number of intervals in which you want to divide the area between the major tick marks on the axis. If you pass the value 1 to it, the minor tick marks will be suppressed;
  • ny allows you to do the same as nx, but then for the y-axis;
  • The tick.ratio indicates the ratio of lengths of minor tick marks to major tick marks. The length of the latter is retrieved from par(tck=x).
4. How To Create Two Different X- or Y-axes

The first option is to create a first plot and to execute the par() function with the new argument put to TRUE to prevent R from clearing the graphics device:

set.seed(101) x <- 1:10 y <- rnorm(10) z <- runif(10, min=1000, max=10000) plot(x, y) par(new = TRUE)

Then, you create the second plot with plot(). You make one of the type

plot(x, z, type = "l", #Plot with lines axes = FALSE, #No axes bty = "n", #Box about plot is suppressed xlab = "", #No labels on x-and y-axis ylab = "")

Note that the with axes argument has been put to FALSE, while you also lave the x- and y-labels blank.

You also add a new axis on the right-hand side by adding the argument side and assigning it the value 4.

Next, you specify the the at argument to indicate the points at which tick-marks need to be drawn. In this case, you compute a sequence of n+1 equally spaced “round” values which cover the range of the values in z with the pretty() function. This ensures that you actually have the numbers from the y-axis from your second plot, which you named z.

Lastly, you add an axis label on the right-hand side:

axis(side=4, at = pretty(range(z))) mtext("z", side=4, line=3)

Note that the side argument can have three values assigned to it: 1 to place the text to the bottom, 2 for a left placement, 3 for a top placement and 4 to put the text to the right. The line argument indicates on which margin line the text starts.

The end result will be like this:

Tip: Try constructing an R plot with two different x-axes! You can find the solution below:

plot.new() set.seed(101) x <- 1:10 y <- rnorm(10) z <- runif(10, min=1000, max=10000) par(mar = c(5, 4, 4, 4) + 0.3) plot(x, y) par(new = TRUE) plot(z, y, type = "l", axes = FALSE, bty = "n", xlab = "", ylab = "") axis(side=3, at = pretty(range(z))) mtext("z", side=3, line=3)

Note that twoord.plot() of the plotrix package and doubleYScale() of the latticeExtra package automate this process:

library(latticeExtra) chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) Age <- chol$AGE Chol <- chol$CHOL Smoke <- chol$SMOKE State <- chol$MORT a <- xyplot(Chol ~ Age|State) b <- xyplot(Smoke ~ Age|State) doubleYScale(a, b, style1 = 0, style2 = 3, add.ylab2 = TRUE, columns=3)

Note that the example above is made with this dataset. If you’re not sure how you can import your data, check out our tutorial on importing data in R.

5. How To Add Or Change The R Plot’s Legend? Adding And Changing An R Plot’s Legend With Basic R

You can easily add a legend to your R plot with the legend() function:

x <- seq(0,pi,0.1) y1 <- cos(x) y2 <- sin(x) plot(c(0,3), c(0,3), type="n", xlab="x", ylab="y") lines(x, y1, col="red", lwd=2) lines(x, y2, col="blue", lwd=2) legend("topright", inset=.05, cex = 1, title="Legend", c("Cosinus","Sinus"), horiz=TRUE, lty=c(1,1), lwd=c(2,2), col=c("red","blue"), bg="grey96")

Note that the arguments pt.cex and title.cex that are described in the documentation of legend() don’t really work. There are some workarounds:

1.Put the title or the labels of the legend in a different font with text.font

x <- seq(0,pi,0.1) y1 <- cos(x) y2 <- sin(x) plot(c(0,3), c(0,3), type="n", xlab="x", ylab="y") lines(x, y1, col="red", lwd=2) lines(x, y2, col="blue", lwd=2) legend("topright", inset=.05, cex = 1, title="Legend", c("Cosinus","Sinus"), horiz=TRUE, lty=c(1,1), lwd=c(2,2), col=c("red","blue"), bg="grey96", text.font=3)

2. Draw the legend twice with different cex values

x <- seq(0,pi,0.1) y1 <- cos(x) y2 <- sin(x) plot(c(0,3), c(0,3), type="n", xlab="x", ylab="y") lines(x, y1, col="red", lwd=2) lines(x, y2, col="blue", lwd=2) legend("topright", inset=.05, c("Cosinus","Sinus"), title="", horiz=TRUE, lty=c(1,1), lwd=c(2,2), col=c("red","blue")) legend(2.05, 2.97, inset=.05, c("",""), title="Legend", cex=1.15, bty="n")

Tip: if you’re interested in knowing more about the colors that you can use in R, check out this very helpful PDF document.

How To Add And Change An R Plot’s Legend And Labels In ggplot2

Adding a legend to your ggplot2 plot is fairly easy. You can just execute the following:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) library(ggplot2) ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + geom_point(aes(colour = factor(chol$MORT), shape=chol$SMOKE)) + xlab("Weight") + ylab("Height")

And it gives you a default legend. But, in most cases, you will want to adjust the appearance of the legend some more.

There are two ways of changing the legend title and labels in ggplot2:

1. If you have specified arguments such as colour or shape, or other aesthetics, you need to change the names and labels through scale_color_discrete and scale_shape_discrete, respectively:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) library(ggplot2) ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + geom_point(aes(colour = factor(chol$MORT), shape = chol$SMOKE)) + xlab("Weight") + ylab("Height") + theme(legend.position=c(1,0.5), legend.justification=c(1,1)) + scale_color_discrete(name ="Condition", labels=c("Alive", "Dead")) + scale_shape_discrete(name="Smoker", labels=c("Non-smoker", "Sigare", "Pipe" ))

Note that you can create two legends if you add the argument shape into the geom_point() function and into the labels arguments!

If you want to move the legend to the bottom of the plot, you can specify the legend.position as "bottom". The legend.justification argument, on the other hand, allows you to position the legend inside the plot.

Tip: check out all kinds of scales that could be used to let ggplot know that other names and labels should be used here.

2. Change the data frame so that the factor has the desired form. For example:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) levels(chol$SMOKE)[levels(chol$SMOKE)=="Non-smoker"] <- "Non-smoker" levels(chol$SMOKE)[levels(chol$SMOKE)=="Sigare"] <- "Sigare" levels(chol$SMOKE)[levels(chol$SMOKE)=="Pipe"] <- "Pipe" names(chol)[names(chol)=="SMOKE"] <- "Smoker"

You can then use the new factor names to make your plot in ggplot2, avoiding the “hassle” of changing the names and labels with extra lines of code in your plotting.

Tip: for a complete cheat sheet on ggplot2, you can go here.

6. How To Draw A Grid In Your R Plot? Drawing A Grid In Your R Plot With Basic R

For some purposes, you might find it necessary to include a grid in your plot. You can easily add a grid to your plot by using the grid() function:

x <- c(1,2,3,4,5) y <- 2*x plot(x,y) grid(10,10)

Drawing A Grid In An R Plot With ggplot2 chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) library(ggplot2) ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + geom_point(aes(colour = factor(chol$MORT), shape = chol$SMOKE)) + xlab("Weight") + ylab("Height") + scale_color_discrete(name ="Condition", labels=c("Alive", "Dead")) + scale_shape_discrete(name="Smoker", labels=c("Non-smoker", "Sigare", "Pipe" )) + theme(legend.position=c(1,0.5), legend.justification=c(1,1), panel.grid.major = element_line(colour = "grey40"), panel.grid.minor = element_line(colour = "grey40"))

Tip: if you don’t want to have the minor grid lines, just pass element_blank() to panel.grid.minor. If you want to fill the background up with a color, add the panel.background = element_rect(fill = "navy") to your code, just like this:

library(ggplot2) ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + geom_point(aes(colour = factor(chol$MORT), shape = chol$SMOKE)) + xlab("Weight") + ylab("Height") + scale_color_discrete(name ="Condition", labels=c("Alive", "Dead")) + scale_shape_discrete(name="Smoker", labels=c("Non-smoker", "Sigare", "Pipe" )) + theme(legend.position=c(1,0.5), legend.justification=c(1,1), panel.grid.major = element_line(colour = "grey40"), panel.grid.minor = element_line(colour = "grey40"), panel.background = element_rect(fill = "navy") ) 7. How To Draw A Plot With A PNG As Background?

You can quickly draw a plot with a .png as a background with the help of the png package. You install the package if you need to, activate it for use in your workspace through the library function library() and you can start plotting!

install.packages("png") library(png)

First, you want to load in the image. Use the readPNG() function to specify the path to the picture!

image <- readPNG("<path to your picture>")

Tip: you can check where your working directory is set at and change it by executing the following commands:

getwd() setwd("<path to a folder>")

If your picture is saved in your working directory, you can just specify readPNG("picture.png") instead of passing the whole path.

Next, you want to set up the plot area:

plot(1:2, type='n', main="Plotting Over an Image", xlab="x", ylab="y")

And you want to call the par() function:

lim <- par()

You can use the par() function to set the graphical parameters in rasterImage(). You use the argument usr to define the extremes of the user coordinates of the plotting region. In this case, you put 1, 3, 2 and 4:

rasterImage(image, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4])

Next, you draw a grid and add some lines:

grid() lines(c(1, 1.2, 1.4, 1.6, 1.8, 2.0), c(1, 1.3, 1.7, 1.6, 1.7, 1.0), type="b", lwd=5, col="red")

This can give you the following result if you use the DataCamp logo:

library(png) image <- readPNG("datacamp.png") plot(1:2, type="n", main="Plotting Over an Image", xlab="x", ylab="y", asp=1) lim <- par() rasterImage(image, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4]) lines(c(1, 1.2, 1.4, 1.6, 1.8, 2.0), c(1.5, 1.3, 1.7, 1.6, 1.7, 1.0), type="b", lwd=5, col="red")

Note that you need to give a .png file as input to readPNG()!

8. How To Adjust The Size Of Points In An R Plot? Adjusting The Size Of Points In An R Plot With Basic R

To adjust the size of the points with basic R, you might just simply use the cex argument:

x <- c(1,2,3,4,5) y <- c(6,7,8,9,10) plot(x,y,cex=2,col="red")

Remember, however, that that R allows you to have much more control over your symbols through the function symbols():

df <- data.frame(x1=1:10, x2=sample(10:99, 10), x3=10:1) symbols(x=df$x1, y=df$x2, circles=df$x3, inches=1/3, ann=F, bg="steelblue2", fg=NULL)

The circles of this plot receive the values of df$x3 as the radii, while the argument inches controls the size of the symbols. When this argument receives a positive number as input, the symbols are scaled to make largest dimension this size in inches.

Adjusting The Size Of Points In Your R Plot With ggplot2

In this case, you will want to adjust the size of the points in your scatterplot. You can do this with the size argument:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT), size = 2) + geom_point() #or ggplot(chol, aes(x=chol$WEIGHT, y=chol$HEIGHT)) + geom_point(size = 2) 9. How To Fit A Smooth Curve To Your R Data

The loess() function is probably every R programmer’s favorite solution for this kind of question. It actually “fits a polynomial surface determined by one or more numerical predictors, using local fitting”.

In short, you have your data:

x <- 1:10 y <- c(2,4,6,8,7,12,14,16,18,20)

And you use the loess() function, in which you correlate y and x. Through this, you specify the numeric response and one to four numeric predictors:

lo <- loess(y~x) ### estimations between data

You plot x and y:

plot(x,y)

And you plot lines in the original plot where you predict the values of lo:

lines(predict(lo))

Which gives you the following plot:

10. How To Add Error Bars In An R Plot Drawing Error Bars With Basic R

The bad news: R can’t draw error bars just like that. The good news: you can still draw the error bars without needing to install extra packages!

#Load the data chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) #Calculate some statistics for the chol dataset library(Rmisc) cholc <- summarySE(chol, measurevar="CHOL", groupvars=c("MORT","SMOKE")) #Plot the data plot(cholc$N, cholc$CHOL, ylim=range(c(cholc$CHOL-cholc$sd, cholc$CHOL+cholc$sd)), pch=19, xlab="Cholesterol Measurements", ylab="Cholesterol Mean +/- SD", main="Scatterplot With sd Error Bars" ) #Draw arrows of a "special" type arrows(cholc$N, cholc$CHOL-cholc$sd, cholc$N, cholc$CHOL+cholc$sd, length=0.05, angle=90, code=3)

If you want to read up on all the arguments that arrows() can take, go here.

Drawing Error Bars With ggplot2 Error Bars Representing Standard Error Of Mean

First summarize your data with the summarySE() function from the Rmisc package:

#Load in the data chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) #Calculate some statistics for the chol dataset library(Rmisc) cholc <- summarySE(chol, measurevar="CHOL", groupvars=c("MORT","SMOKE"))

Then, you can use the resulting dataframe to plot some of the variables, drawing error bars for them at the same time, with, for example, the standard error of mean:

If you want to change the position of the error bars, for example, when they overlap, you might consider using the position_dodge() function:

#Load in the data chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) #Calculate some statistics for the chol dataset library(Rmisc) cholc <- summarySE(chol, measurevar="CHOL", groupvars=c("MORT","SMOKE")) #Plot the cholc dataset library(ggplot2) pd <- position_dodge(0.1) ggplot(cholc, aes(x=SMOKE, y=CHOL, colour=MORT)) + geom_errorbar(aes(ymin=CHOL-se, ymax=CHOL+se, group=MORT), width=.1, position=pd) + geom_line(aes(group=MORT)) + geom_point()


Tip: if you get an error like “geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?”, it usually requires you to adjust the group aesthetic.

Error Bars Representing Confidence Intervals

Continuing from the summary of your data that you made with the summarySE() function, you can also draw error bars that represent confidence intervals. In this case, a plot with error bars of 95% confidence are plotted.

#Load in the data chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) #Calculate some statistics for the chol dataset library(Rmisc) cholc <- summarySE(chol, measurevar="CHOL", groupvars=c("MORT","SMOKE")) #Plot the cholc dataset library(ggplot2) pd <- position_dodge(0.1) ggplot(cholc, aes(x=SMOKE, y=CHOL, colour=MORT)) + geom_errorbar(aes(ymin=CHOL-ci, ymax=CHOL+ci, group=MORT), width=.1, colour="black", position=pd) + geom_line(aes(group=MORT)) + geom_point()


Note how the color of the error bars is now set to black with the colour argument.

Error Bars Representing The Standard Deviation

Lastly, you can also use the results of the summarySE() function to plot error bars that represent the standard deviation. Specifically, you would just have to adjust the ymin and ymax arguments that you pass to geom_errorbar():

#Load in the data chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) #Calculate some statistics for the chol dataset library(Rmisc) cholc <- summarySE(chol, measurevar="CHOL", groupvars=c("MORT","SMOKE")) #Plot the cholc dataset library(ggplot2) pd <- position_dodge(0.1) ggplot(cholc, aes(x=SMOKE, y=CHOL, colour=MORT)) + geom_errorbar(aes(ymin=CHOL-sd, ymax=CHOL+sd, group=MORT), width=.1, position=pd) + geom_line(aes(group=MORT)) + geom_point()

Big tip: also take a look at this for more detailed examples on how to plot means and error bars.

11. How To Save A Plot As An Image On Disc

You can use dev.copy() to copy your graph, made in the current graphics device to the device or folder specified by yourself.

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x,y) dev.copy(jpeg, filename="<path to your file/name.jpg>"); dev.off(); (x) (y) 12. How To Plot Two R Plots Next To Each Other? How To Plot Two Plots Side By Side Using Basic R

You can do this with basic R commands:

d0 <- matrix(rnorm(15), ncol=3) d1 <- matrix(rnorm(15), ncol=3) limits <- range(d0,d1) #Set limits par(mfrow = c(1, 2)) boxplot(d0, ylim=limits) boxplot(d1, ylim=limits)

By adding the par() function with the mfrow argument, you specify a vector, which in this case contains 1 and 2: all figures will then be drawn in a 1-by-2 array on the device by rows (mfrow). In other words, the boxplots from above will be printed in one row inside two columns.

How To Plot Two Plots Next To Each Other Using ggplot2

If you want to put plots side by side and if you don’t want to specify limits, you can consider using the ggplot2 package to draw your plots side-by-side:

#Load in the data chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) #Calculate some statistics for the chol dataset library(Rmisc) cholc <- summarySE(chol, measurevar="CHOL", groupvars=c("MORT","SMOKE")) #Plot the cholc dataset library(ggplot2) ggplot(cholc, aes(x=SMOKE, y=CHOL, colour=MORT)) + geom_errorbar(aes(ymin=CHOL-sd, ymax=CHOL+sd, group=MORT), width=.1, position=pd) + geom_line(aes(group=MORT)) + geom_point() + facet_grid(. ~ MORT)


Note how you just add the facet_grid() function to indicate that you want two plots next to each other. The element that is used to determine how the plots are drawn, is MORT, as you can well see above!

How To Plot More Plots Side By Side Using gridExtra

To get plots printed side by side, you can use the gridExtra package; Make sure you have the package installed and activated in your workspace and then execute something like this:

library(gridExtra) plot1 <- qplot(1) plot2 <- qplot(1) grid.arrange(plot1, plot2, ncol=2)

Note how here again you determine how the two plots will appear to you thanks to the ncol argument.

How To Plot More Plots Side By Side Using lattice

Just like the solution with ggplot2 package, the lattice package also doesn’t require you to specify limits or the way you want your plots printed next to each other.

Instead, you use bwplot() to make trellis graphs with the graph type of a box plot. Trellis graphs display a variable or the relationship between variables, conditioned on one or more other variables.

In this case, if you’re using the chol data set (which you can find here or load in with the read.table() function given below), you display the variable CHOL separately for every combination of factor SMOKE and MORT levels:

#Load in the data chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) #Plot two plots side by side library(lattice) bwplot(~ CHOL|SMOKE+MORT, chol)

Plotting Plots Next To Each Other With gridBase

Another way even to put two plots next to each other is by using the gridBase package, which takes care of the “integration of base and grid graphics”. This could be handy when you want to put a basic R plot and a ggplot next to each other.

You work as follows: first, you activate the necessary packages in your workspace. In this case, you want to have gridBase ready to put the two plots next to each other and grid and ggplot2 to actually make your plots:

library(grid) library(gridBase) library(ggplot2) plot.new() gl <- grid.layout(nrow=1, ncol=2) vp.1 <- viewport(layout.pos.col=1, layout.pos.row=1) vp.2 <- viewport(layout.pos.col=2, layout.pos.row=1) pushViewport(viewport(layout=gl)) pushViewport(vp.1) par(new=TRUE, fig=gridFIG()) plot(x = 1:10, y = 10:1) popViewport() pushViewport(vp.2) ggplotted <- qplot(x=1:10,y=10:1, 'point') print(ggplotted, newpage = FALSE) popViewport(1)


If you want and need it, you can start an empty plot:

plot.new()

To then set up the layout:

gl <- grid.layout(nrow=1, ncol=2)

Note that since you want the two plots to be generated next to each other, this requires you to make a grid layout consisting of one row and two columns.

Now, you want to fill up the cells of the grid with viewports. These define rectangular regions on your graphics device with the help of coordinates within those regions. In this case, it’s much more handy to use the specifications of the grid that have just been described above rather than real x-or y-coordinates. That is why you should use the layout.pos.col and layout.pos.row arguments:

vp.1 <- viewport(layout.pos.col=1, layout.pos.row=1) vp.2 <- viewport(layout.pos.col=2, layout.pos.row=1)

Note again that since you want the two plots to be generated next to each other, you want to put one plot in the first column and the other in the second column, both located on the first row.

Since the viewports are only descriptions or definitions, these kinds of objects need to be pushed onto the viewport tree before you can see any effect on the drawing. You want to use the pushViewport() function to accomplish this:

pushViewport(viewport(layout=gl))

Note the pushViewport() function takes the viewport() function, which in itself contains a layout argument. This last argument indicates “a grid layout object which splits the viewport into subregions”.

Remember that you started out making one of those objects.

Now you can proceed to adding the first rectangular region vp.1 to the ViewPort tree:

pushViewport(vp.1)

After which you tell R with gridFig() to draw a base plot within a grid viewport (vp.1, that is). The fig argument normally takes the coordinates of the figure region in the display region of the device. In this case, you use the fig argument to start a new plot, adding it to an existing plot use by adding new = TRUE in the par() function as well. You plot the base graphic and remove the viewport from the tree:

par(new=TRUE, fig=gridFIG()) plot(x = 1:10, y = 10:1) popViewport()

Note that you can specify in the popViewport() function an argument to indicate how many viewports you want to remove from the tree. If this value is 0, this indicates that you want to remove the viewports right up to the root viewport. The default value of this argument is 1.

Go to add the second rectangular region vp.2 to the ViewPort tree. You can then make the ggplot and remove the viewport from the tree.

pushViewport(vp.2) ggplotted <- qplot(x=1:10, y=10:1, 'point') print(ggplotted, newpage = FALSE) popViewport(1)

Note that you need to print to print the graphics object made by qplot() in order to actually draw it and get it displayed. At the same time, you also want to specify newpage = FALSE, otherwise you’ll only see the qplot()

Also remember that the default value of viewports to remove in the function popViewport() is set at 1. This makes it kind of redundant to put popViewport(1) in the code.

13. How To Plot Multiple Lines Or Points? Using Basic R To Plot Multiple Lines Or Points In The Same R Plot

To plot two or more graphs in the same plot, you basically start by making a usual basic plot in R. An example of this could be:

x <- seq(0,pi,0.1) y1 <- cos(x) plot(x, y1, type="l", col = "red")

Then, you start adding more lines or points to the plot. In this case, you add more lines to the plot, so you’ll define more y axes:

y2 <- sin(x) y3 <- tan(x) y4 <- log(x)

Then, you plot these y axes with the use of the lines() function:

lines(x,y2,col="green") lines(x,y2,col="green") lines(x,y3,col="black") lines(x,y4,col="blue")

This gives the following result:

Note that the lines() function takes in three arguments: the x-axis and the y-axis that you want to plot and the color (represented with the argument col) in which you want to plot them. You can also include the following features:

Feature Argument Input Line type lty Integer or character string Line width lwd Integer Plot type pch Integer or single character Line end style lend Integer or string Line join style ljoin Integer or string Line mitre limit lmitre Integer < 1

Here are some examples:

lines(x,y2,col="green", lty = 2, lwd = 3) lines(x,y2,col="green", lty = 5, lwd = 2, pch = 2) lines(x,y3,col="black", lty = 3, lwd = 5, pch = 3, lend = 0, ljoin = 2) lines(x,y4,col="blue", lty = 1, lwd = 2, pch = 3, lend = 2, ljoin = 1, lmitre = 2)

Note that the pch argument does not function all that well with the lines() function and that it’s best to use it only with points().

Tip: if you want to plot points in the same graph, you can use the points() function:

y5 <- x^3 points(x, y5, col="yellow")

You can add the same arguments to the points() function as you did with the lines() function and that are listed above. There are some additions, though:

Feature Argument Input Background (fill) color bg Only if pch = 21:25 Character (or symbol) expansion cex Integer

Code examples of these arguments are the following:

points(x,y4,col="blue", pch=21, bg = "red") points(x, y5, col="yellow", pch = 5, bg = "blue")

If you incorporate these changes into the plot that you see above, you will get the following result:

x <- seq(0,pi,0.1) y1 <- cos(x) plot(x,y1,type="l" ,col = "red") #basic graphical object y2 <- sin(x) y3 <- tan(x) y4 <- log(x) y5 <- x^3 lines(x,y2,col="green", lty = 1, lwd = 3) #first layer lines(x,y2,col="green", lty = 3, lwd = 2, pch = 2) #second layer lines(x,y3,col="black", lty = 2, lwd = 1, pch = 3, lend = 0, ljoin = 2) #third layer points(x,y4,col="blue", pch=21, bg = "red") #fourth layer points(x, y5, col="yellow", pch = 24, bg = "blue") #fifth layer

Using ggplot2 To Plot Multiple Lines Or Points In One R Plot

The ggplot2 package conveniently allows you to also create layers, which allows you to basically plot two or more graphs into the same R plot without any difficulties and pretty easily:

library(ggplot2) x <- 1:10 y1 <- c(2,4,6,8,7,12,14,16,18,20) y2 <- rnorm(10, mean = 5) df <- data.frame(x, y1, y2) ggplot(df, aes(x)) + # basic graphical object geom_line(aes(y=y1), colour="red") + # first layer geom_line(aes(y=y2), # second layer colour="green") 14. How To Fix The Aspect Ratio For Your R Plots

If you want to put your R plot to be saved as an image where the axes are proportional to their size, it’s a sign that you want to fix the aspect ratio.

Adjusting The Aspect Ratio With Basic R

When you’re working with basic R commands to produce your plots, you can add the argument asp of the plot() function, completed with an integer, to set your aspect ratio. Look at this first example without a defined aspect ratio:

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x,y)

And compare this now to the plot where the aspect ratio is defined with the argument asp:

x <- seq(0,2*pi,0.1) y <- sin(x) plot(x, y, asp=2)

Adjusting The Aspect Ratio For Your Plots With ggplot2

To fix the aspect ratio for ggplot2 plots, you just add the function coord_fixed(), which provides a “fixed scale coordinate system [that] forces a specified ratio between the physical representation of data units on the axes”.

In other words, this function allows you to specify a number of units on the y-axis which is equivalent to one unit on the x-axis. The default is always set at 1, which means that one unit on the x-axis has the same length as one unit on the y-axis. If your ratio is set at a higher value, the units on the y-axis are longer than units on the x-axis and vice versa.

Compare the following examples:

library(ggplot2) df <- data.frame( x = runif(100, 0, 5), y = runif(100, 0, 5)) ggplot(df, aes(x=x, y=y)) + geom_point()

versus

library(ggplot2) df <- data.frame( x = runif(100, 0, 5), y = runif(100, 0, 5)) ggplot(df, aes(x=x, y=y)) + geom_point() + coord_fixed(ratio=1)

Adjusting The Aspect Ratio For Your Plots With MASS

You can also consider using the MASS package, which encompasses the eqscplot() function: it produces plots with geometrically equal scales. It does this for scatterplots:

chol <- read.table(url("http://assets.datacamp.com/blog_assets/chol.txt"), header = TRUE) library(MASS) x = chol$HEIGHT y = chol$WEIGHT z = as.numeric(chol$MORT) eqscplot(x, y, ratio = 1, col=c("red", "green"), pch=c(1,2))

Tip: you might do well starting a new plot frame before executing the code above!

Note that you can give additional arguments to the eqscplot() function to customize the scatterplot’s look!

15. What Is The Function Of hjust And vjust In ggplot2?

Well, you basically use these arguments when you want to set the position of text in your ggplot. hjust allows you to define the horizontal justification, while vjust is meant to control the vertical justification. See the documentation on geom_text() for more information.

To demonstrate what exactly happens, you can create a data frame from all combinations of factors with the expand.grid() function:

hjustvjust <- expand.grid(hjust=c(0, 0.5, 1), vjust=c(0, 0.5, 1), angle=c(0, 45, 90), text="Text" )

Note that hjust and vjust can only take values between 0 and 1.

  • 0 means that the text is left-justified; In other words, all text is aligned to the left margin. This is usually what you see when working with text editors such as Word.
  • 1 means that the text is right-justified: all text is aligned to the right margin.

Then, you can plot the data frame that you have just made above with the ggplot() function, defining the x-and y-axis as “hjust” and “vjust” respectively:

library(ggplot2) ggplot(hjustvjust, aes(x=hjust, y=vjust)) + geom_point() + geom_text(aes(label=text, angle=angle, hjust=hjust, vjust=vjust)) + facet_grid(~angle) + scale_x_continuous(breaks=c(0, 0.5, 1), expand=c(0, 0.2)) + scale_y_continuous(breaks=c(0, 0.5, 1), expand=c(0, 0.2))

Also note how the hjust and vjust arguments are added to geom_text(), which takes care of the textual annotations to the plot.

In the plot above you see that the text at the point (0,0) is left-aligned, horizontally as well as vertically. On the other hand, the text at point (1,1) is right-aligned in horizontal as well as vertical direction. The point (0.5,0.5) is right in the middle: it’s not really left-aligned nor right-aligned for what concerns the horizontal and vertical directions.

Note that when these arguments are defined to change the axis text, the horizontal alignment for axis text is defined in relation to the entire plot, not to the x-axis!

DF <- data.frame(x=LETTERS[1:3], y=1:3) p <- ggplot(DF, aes(x,y)) + geom_point() + ylab("Very long label for y") + theme(axis.title.y=element_text(angle=0)) p1 <- p + theme(axis.title.x=element_text(hjust=0)) + xlab("X-axis at hjust=0") p2 <- p + theme(axis.title.x=element_text(hjust=0.5)) + xlab("X-axis at hjust=0.5") p3 <- p + theme(axis.title.x=element_text(hjust=1)) + xlab("X-axis at hjust=1") library(gridExtra) grid.arrange(p1, p2, p3)

Also try for yourself what defining the vjust agument to change the axis text does to the representation of your plot:

DF <- data.frame(x=c("ana","b","cdefghijk","l"), y=1:4) p <- ggplot(DF, aes(x,y)) + geom_point() p1 <- p + theme(axis.text.x=element_text(vjust=0, colour="red")) + xlab("X-axis labels aligned with vjust=0") p2 <- p + theme(axis.text.x=element_text(vjust=0.5, colour="red")) + xlab("X-axis labels aligned with vjust=0.5") p3 <- p + theme(axis.text.x=element_text(vjust=1, colour="red")) + xlab("X-axis labels aligned with vjust=1") library(gridExtra) grid.arrange(p1,p2,p3)

To go to the original excellent discussion, from which the code above was adopted, click here.

As A Last Note…

It’s really worth checking out this article, which lists 10 tips for making your R graphics look their best!

Also, if you want to know more about data visualization, you might consider checking out DataCamp’s interactive course on data visualization with ggvis, given by Garrett Grolemund, author of Hands on Programming with R, as well as Data Science with R.

Or maybe our course on reporting with R Markdown can interest you!

The post 15 Questions All R Users Have About Plots appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

MRAN’s Packages Spotlight

Thu, 2015-07-30 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

New R packages just keep coming. The following plot, constructed with information from the monthly files on Dirk Eddelbuettel's CRANberries site, shows a plot of the number of new packages released to CRAN between January 1, 2013 and July 27, 2015 by month (not quite 31 months).

This is amazing growth! The mean rate is about 125 new packages a month. How can anyone keep up? The direct approach, of course, would be to become an avid, frequent reader of CRANberries. Every day the CRAN:New link presents the relentless roll call of new arrivals. However, dealing with this extreme level of tediousness is not for everyone.

At MRAN we are attempting to provide some help with the problem of keeping up with what's new through the old fashioned (pre-machine learning) practice of making some idiosyncratic, but not completely capricious, human generated recommendations. With every new release of RRO we publish on the Package Spotlight page brief descriptions of packages in three categories: New Packages, Updated Packages and GitHub packages. None of these lists are intended to be either comprehensive or complete in any sense.

The New Packages list includes new packages that have been released to CRAN since the previous release of RRO. My general rules for selecting packages for this list are: (1) that they should either be tools or infrastructure packages that may prove to be useful to a wide audience or (2) they should involve a new algorithm or statistical technique that I think will be of interest to statisticians and data scientists working in many different areas. The following two packages respectively illustrate these two selection rules:

metricsgraphics V0.8.5: provides an htmlwidgets widgets interface to the MetricsGraphics.js D3 JavaScript library for plotting time series data. The vignette shows what it can do

rotationForest V0.1: provides an implementation of the new Rotation Forest binary ensemble classifier described in the paper by Rodriguez et. al

I also tend to favor packages that are backed by a vignette, paper or url that provides additional explanatory material.

Of course, any scheme like this is limited by the knowledge and biases of the curator. I am particularly worried about missing packages targeted towards biotech applications that may indeed have broader appeal. The way to mitigate the shortcomings of this approach is to involve more people. So if you come across a new package that you think may have broad appeal send us a note and let us know why (open@revolutionanalytics.com).

The Updated Package list is constructed with the single criterion that the fact that the package was updated should convey news of some sort. Most of the very popular and useful packages are updated frequently, some approaching monthly updates. So, even though they are important packages the fact that they have been updated is generally no news at all. It is also the case that package authors generally do not put much effort in to describing the updates. In my experience poking around CRAN I have found that the NEWS directories for packages go mostly unused. (An exemplary exception is the NEWS for ggplot2.)

Finally, the GitHub list is mostly built from repositories that are trending on GitHub with a few serendipitous finds included.

We would be very interested in learning how you keep up with new R packages. Please leave us a comment.

Post Script:

The code for generating the plot may be found here: Download New_packages

Also, we have written quite a few posts over the last year or so about the difficulties of searching for relevant packages on CRAN. Here are links to three recent posts:

How many packages are there really on CRAN?
Fishing for packages in CRAN
Working with R Studio CRAN Logs

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The little mixed model that could, but shouldn’t be used to score surgical performance

Thu, 2015-07-30 03:33

(This article was first published on Statistical Reflections of a Medical Doctor » R, and kindly contributed to R-bloggers)

The Surgeon Scorecard

Two weeks ago, the world of medical journalism was rocked by the public release of ProPublica’s Surgeon Scorecard. In this project ProPublica “calculated death and complication rates for surgeons performing one of eight elective procedures in Medicare, carefully adjusting for differences in patient health, age and hospital quality.”  By making the dataset available through a user friendly portal, the intended use of this public resource was to “use this database to know more about a surgeon before your operation“.

The Scorecard was met with great enthusiasm and coverage by non-medical media. TODAY.com headline “nutshelled” the Scorecard as a resource that “aims to help you find doctors with lowest complication rates“. A (?tongue in cheek) NBC headline tells us the scorecard “It’s complicated“. On the other hand the project was not well received by my medical colleagues. John Mandrola gave it a failing grade in Medscape. Writing at KevinMD.com, Jeffrey Parks called it a journalistic low point for ProPublica. Jha Shaurabh pointed out a potential paradox  in a statistically informed, easy to read and highly entertaining piece. In this paradox, the surgeon with the higher complication case who takes high risk patients from a disadvantaged socio-economic environment, may actually be the surgeon one wants to perform one’s surgery! Ed Schloss summarized the criticism (in the blogosphere and twitter) in an open letter and asked for peer review of the Scorecard methodology.

The criticism to date has largely focused on the potential for selection effects (as the Scorecard is based on Medicare data, and does not include data from private insurers), the incomplete adjustment for confounders, the paucity of data for individual surgeons, the counting of complications and re-admission rates, decisions about risk category classification boundaries and even data errors (ProPublica’s response arguing that the Scorecard matters may be found here). With a few exceptions (e.g. see Schloss’s blogpost in which the complexity of the statistical approach is mentioned) the criticism of the statistical approach (including my own comments in twitter) has largely focused on these issues.

On the other hand, the underlying statistical methodology (here and there) that powers the Scorecard has not received much attention. Therefore I undertook a series of simulation experiments to explore the impact of the statistical procedures on the inferences afforded by the Scorecard.

The mixed model that could – a short non-technical summary of ProPublica’s approah

ProPublica’s approach to the scorecard is based on logistic regression model, in which individual surgeon (and hospital) performance (probability of suffering a complication) is modelled using Gaussian random effects, while patient level characteristics that may act as confounders are adjusted for, using fixed effects. In a nutshell this approach implies fitting a model of the average complication rate that is function of the fixed effects (e.g. patient age) for the entire universe of surgeries  performed in the USA. Individual surgeon and hospital factors modify this complication rate, so that a given surgeon and hospital will have an individual  rate that varies around the population average. These individual surgeon and hospital factors are constrained to follow Gaussian, bell-shaped distribution when analyzing complication data. After model fitting, these predicted random effects are used to quantify and compare surgical performance. A feature of mixed modeling approaches is the unavoidable shrinkage of the raw complication rate towards the population mean. Shrinkage implies that the dynamic range of the  actually observed complication rates is compressed. This is readily appreciated in the figures generated by the ProPublica analytical team:

In their methodological white paper the ProPublica team notes:

While raw rates ranged from 0.0 to 29%, the Adjusted Complication Rate goes from 1.1 to 5.7%. …. shrinkage is largely a result of modeling in the first place, not due to adjusting for case mix. This shrinkage is another piece of the measured approach we are taking: we are taking care not to unfairly characterize surgeons and hospitals.”

These features should alert us that something is going on. For if a model can distort the data to such a large extent, then the model should be closely scrutinized before being accepted. In fact, given these observations,  it is possible that one mistakes the noise from the model for the information hidden in the empirical data. Or, even more likely, that one is not using the model in the most productive manner.

Note that these comments  should not be interpreted as a criticism against the use of mixed models in general, or even for the particular aspects of the Scorecard project. They are rather a call for re-examining the modeling assumptions and for gaining a better understanding of the model “mechanics of prediction” before releasing the Kraken to the world.

The little mixed model that shouldn’t

There are many technical aspects one could potentially misfire in a Generalized Linear Mixed Model for complication rates. Getting the wrong shape of the random effects distribution is of concern (e.g. assuming it is bell shaped when it is not). Getting the underlying model wrong, e.g. assuming the binomial model for complication rates while a model with many more zeros (a zero inflated model) may be more appropriate, is yet another potential problem area. However, even if  these factors are not operational, then one may still be misled when using the results of the model. In particular, the major area of concern for such models is the cluster size: the number of observations per individual random effect (e.g. surgeon) in the dataset. It is this factor, rather than the actual size of the dataset that determines the precision in the individual random affects. Using a toy example, we show that the number of observations per surgeon typical of the Scorecard dataset, leads to predicted  random effects that may be far from their true value. This seems to stem from the non-linear nature of the logistic regression model. As we conclude in our first technical post:

  • Random Effect modeling of binomial outcomes require a substantial number of observations per individual (in the order of thousands) for the procedure to yield estimates of individual effects that are numerically indistinguishable  from the true values.

 

Contrast this conclusion to the cluster size in the actual scorecard:

Procedure Code N (procedures) N(surgeons) Procedures per surgeon 51.23 201,351 21,479 9.37 60.5 78,763 5,093 15.46 60.29 73,752 7,898 9.34 81.02 52,972 5,624 9.42 81.07 106,689 6,214 17.17 81.08 102,716 6,136 16.74 81.51 494,576 13,414 36.87 81.54 1,190,631 18,029 66.04 Total 2,301,450 83,887 27.44

In a follow up simulation study we demonstrate that this feature results in predicted individual effects that are non-uniformly shrank towards their average value. This compromises the ability of mixed model predictions to separate the good from the bad “apples”.

In the second technical post, we undertook a simulation study to understand the implications of over-shrinkage for the Scorecard project. These are best understood by a numerical example from one of simulated datasets. To understand this example one should note that the individual random effects have the interpretation of (log-) odds ratios. Hence, the difference in these random effects when exponentiated yield the odds ratio of suffering a complication in the hands of a good relative to a bad surgeon. By comparing these random effects for good and bad surgeons who are equally bad (or good) relative to the mean (symmetric quantiles around the median), one can get an idea of the impact of using the predicted random effects to carry out individual comparisons.

Good Bad Quantile (Good) Quantile (Bad) True OR Pred OR Shrinkage Factor -0.050 0.050 48.0 52.0 0.905 0.959 1.06 -0.100 0.100 46.0 54.0 0.819 0.920 1.12 -0.150 0.150 44.0 56.0 0.741 0.883 1.19 -0.200 0.200 42.1 57.9 0.670 0.847 1.26 -0.250 0.250 40.1 59.9 0.607 0.813 1.34 -0.300 0.300 38.2 61.8 0.549 0.780 1.42 -0.350 0.350 36.3 63.7 0.497 0.749 1.51 -0.400 0.400 34.5 65.5 0.449 0.719 1.60 -0.450 0.450 32.6 67.4 0.407 0.690 1.70 -0.500 0.500 30.9 69.1 0.368 0.662 1.80 -0.550 0.550 29.1 70.9 0.333 0.635 1.91 -0.600 0.600 27.4 72.6 0.301 0.609 2.02 -0.650 0.650 25.8 74.2 0.273 0.583 2.14 -0.700 0.700 24.2 75.8 0.247 0.558 2.26 -0.750 0.750 22.7 77.3 0.223 0.534 2.39 -0.800 0.800 21.2 78.8 0.202 0.511 2.53 -0.850 0.850 19.8 80.2 0.183 0.489 2.68 -0.900 0.900 18.4 81.6 0.165 0.467 2.83 -0.950 0.950 17.1 82.9 0.150 0.447 2.99 -1.000 1.000 15.9 84.1 0.135 0.427 3.15 -1.050 1.050 14.7 85.3 0.122 0.408 3.33 -1.100 1.100 13.6 86.4 0.111 0.390 3.52 -1.150 1.150 12.5 87.5 0.100 0.372 3.71 -1.200 1.200 11.5 88.5 0.091 0.356 3.92 -1.250 1.250 10.6 89.4 0.082 0.340 4.14 -1.300 1.300 9.7 90.3 0.074 0.325 4.37 -1.350 1.350 8.9 91.1 0.067 0.310 4.62 -1.400 1.400 8.1 91.9 0.061 0.297 4.88 -1.450 1.450 7.4 92.6 0.055 0.283 5.15 -1.500 1.500 6.7 93.3 0.050 0.271 5.44 -1.550 1.550 6.1 93.9 0.045 0.259 5.74 -1.600 1.600 5.5 94.5 0.041 0.247 6.07 -1.650 1.650 4.9 95.1 0.037 0.236 6.41 -1.700 1.700 4.5 95.5 0.033 0.226 6.77 -1.750 1.750 4.0 96.0 0.030 0.216 7.14 -1.800 1.800 3.6 96.4 0.027 0.206 7.55 -1.850 1.850 3.2 96.8 0.025 0.197 7.97 -1.900 1.900 2.9 97.1 0.022 0.188 8.42 -1.950 1.950 2.6 97.4 0.020 0.180 8.89 -2.000 2.000 2.3 97.7 0.018 0.172 9.39 -2.050 2.050 2.0 98.0 0.017 0.164 9.91

From this table it can be seen that predicted odds ratios are always larger than the true ones. The ratio of these odds ratios (the shrinkage factor) is larger, the more extreme comparisons are contemplated.

In summary, the use of the random effects models for the small cluster sizes (number of observations per surgeon) is likely to lead to estimates (or rather predictions) of individual effects that are smaller than their true values. Even though one should expect the differences to decrease with larger cluster sizes, this is unlikely to happen in real world datasets (how often does one come across a surgeon who has performed 1000s of operation of the same type before they retire?). Hence, the comparison of  surgeon performance based on these random effect predictions is likely to be misleading due to over-shrinkage.

Where to go from here?

ProPublica should be congratulated for taking up such an ambitious, and ultimately useful project. However, the limitations of the adopted approach should make one very skeptical about accepting the inferences from their modeling tool. In particular, the small number of observations per surgeon limits the utility of the predicted random effects to directly compare surgeons due to over-shrinkage. Further studies are required before one could use the results of mixed effects modeling for this application. Based on some limited simulation experiments (that we do not present here), it seems that relative rankings of surgeons may be robust measures of surgical performance, at least compared to the absolute rates used by the Scorecard. Adding my voice to that of Dr Schloss, I think it is time for an open and transparent dialogue (and possibly a “crowdsourced” research project) to better define the best measure of surgical performance given the limitations of the available data. Such a project could also explore other directions, e.g. the explicit handling of zero inflation and even go beyond the ubiquitous bell shaped curve. By making the R code available, I hope that someone (possibly ProPublica) who can access more powerful computational resources can perform more extensive simulations. These may better define other aspects of the modeling approach and suggest improvements in the scorecard methodology. In the meantime, it is probably a good idea not to exclusively rely on the numerical measures of the scorecard when picking up the surgeon who will perform your next surgery.

 

 

To leave a comment for the author, please follow the link and comment on his blog: Statistical Reflections of a Medical Doctor » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Empirical bias analysis of random effects predictions in linear and logistic mixed model regression

Thu, 2015-07-30 02:54

(This article was first published on Statistical Reflections of a Medical Doctor » R, and kindly contributed to R-bloggers)

In the first technical post in this series, I conducted a numerical investigation of the biasedness of random effect predictions in generalized linear mixed models (GLMM), such as the ones used in the Surgeon Scorecard, I decided to undertake two explorations: firstly, the behavior of these estimates as more and more data are gathered for each individual surgeon and secondly whether the limiting behavior of these estimators critically depends on the underlying GLMM family. Note that the first question directly assesses whether the random effect estimators reflect the underlying (but unobserved) “true” value of the individual practitioner effect in logistic regression models for surgical complications. On the other hand the second simulation examines a separate issue, namely whether the non-linearity of the logistic regression model affects the convergence rate of the random effect predictions towards their true value.

For these simulations we will examine three different ranges of dataset sizes for each surgeon:

  • small data (complication data from between 20-100 cases/ surgeon are available)
  • large data (complications from between 200-1000 cases/surgeon)
  • extra large data (complications from between 1000-2000 cases/surgeon)

We simulated 200 surgeons (“random effects”) from a normal distribution with a mean of zero and a standard deviation of 0.26, while the population average complication rate was set t0 5%. These numbers were chosen to reflect the range of values (average and population standard deviation) of the random effects in the Score Card dataset, while the use of 200 random effects was a realistic compromise with the computational capabilities of the Asus Transformer T100 2 in 1 laptop/tablet that I used for these analyses.

The following code was used to simulate the logistic case for small data (the large and extra large cases were simulated by changing the values of the Nmin and Nmax variables).

library(lme4) library(mgcv) ## helper functions logit<-function(x) log(x/(1-x)) invlogit<-function(x) exp(x)/(1+exp(x)) ## simulate cases simcase<-function(N,p) rbinom(N,1,p) ## simulation scenario pall<-0.05; # global average Nsurgeon<-200; # number of surgeons Nmin<-20; # min number of surgeries per surgeon Nmax<-100; # max number of surgeries per surgeon ## simulate individual surgical performance ## how many simulations of each scenario set.seed(123465); # reproducibility ind<-rnorm(Nsurgeon,0,.26) ; # surgical random effects logitind<-logit(pall)+ind ; # convert to logits pind<-invlogit(logitind); # convert to probabilities Nsim<-sample(Nmin:Nmax,Nsurgeon,replace=TRUE); # number of cases per surgeon complications<-data.frame(ev=do.call(c,mapply(simcase,Nsim,pind,SIMPLIFY=TRUE)), id=do.call(c,mapply(function(i,N) rep(i,N),1:Nsurgeon,Nsim))) complications$id<-factor(complications$id)

 

A random effect and fixed effect model were fit to these data (the fixed effect model is simply a series of independent fits to the data for each random effect):

## Random Effects fit2<-glmer(ev~1+(1|id),data=complications,family=binomial,nAGQ=2) ran2<-ranef(fit2)[["id"]][,1] c2<-cor(ran2,ind) int2<-fixef(fit2) ranind2<-ran2+int2 ## Fixed Effects fixfit<-vector("numeric",Nsurgeon) for(i in 1:Nsurgeon) { fixfit[i]<-glm(ev~1,data=subset(complications,id==i),family="binomial")$coef[1] }

The corresponding Gaussian GLMM cases were simulated by making minor changes to these codes. These are shown below:

simcase<-function(N,p) rnorm(N,p,1) fit2<-glmer(ev~1+(1|id),data=complications,nAGQ=2) fixfit[i]<-glm(ev~1,data=subset(complications,id==i),family="gaussian")$coef[1]

The predicted random effects were assessed against the simulated truth by smoothing regression splines. In these regressions, the intercept yields the bias of the average of the predicted random effects vis-a-vis the truth, while the slope of the regression quantifies the amount of shrinkage effected by the mixed model formulation. For unbiased estimation not only would we like the intercept to be zero, but also the slope to be equal to one. In this case, the predicted random effect would be equal to its true (simulated) value. Excessive shrinkage would result in a slope that is substantially different from one. Assuming that the bias (intercept) is not different from zero, the relaxation of the slope towards one quantifies the consistency and the bias (or rather its rate of convergence) of these estimators using simulation techniques (or so it seems to me).

The use of smoothing (flexible), rather than simple linear regression, to quantify these relationships does away with a restrictive assumption: that the amount of shrinkage is the same throughout the range of the random effects:

## smoothing spline (flexible) fit fitg<-gam(ran2~s(ind) ## linear regression fitl<-lm(ran2~ind)

The following figure shows the results of the flexible regression (black with 95% CI, dashed black) v.s. the linear regression (red) and the expected (blue) line (intercept of zero, slope of one).

Predicted v.s. simulated random effects for logistic and linear mixed regression as a function of the number of observations per random effect (cluster size)

Several observations are worth noting in this figure.
First
, the flexible regression was indistinguishable from a linear regression in all cases; hence the red and black lines overlap. Stated in other terms, the amount of shrinkage was the same across the range of the random effect values.
Second, the intercept in all flexible models was (within machine precision) equal to zero. Consequently, when estimating a group of random effects their overall mean is (unbiasedly) estimated.
Third, the amount of shrinkage of individual random effects appears to be excessive for small sample sizes (i.e. few cases per surgeon). Increasing the number of cases decreases the shrinkage, i.e. the black and red lines come closer to the blue line as N is increased from 20-100 to 1000-2000. Conversely, for small cluster sizes the amount of shrinkage is so excessive that one may lose the ability to distinguish between individuals with very different complication rates. This is reflected by a regression line between the predicted and the simulated random effect value that is nearly horizontal.
Fourth, the rate of convergence of the predicted random effects to their true value critically depends upon the linearity of the regression model. In particular, the shrinkage of logistic regression model with 1000-2000 observations per case is almost the same at that of a linear model with 20-100 for the parameter values considered in this simulation.

An interesting question is whether these observations (overshrinkage of random effects from small sample sizes in logistic mixed regression) reflects the use of random effects in modeling, or whether they are simply due to the interplay between sample size and the non-linearity of the statistical model. Hence, I turned to fixed effects modeling of the same datasets. The results of these analyses are summarized in the following figure:

Difference between fixed effect estimates of random effects(black histograms) v.s. random effects predictions (density estimators: red lines) relative to their simulated (true) values

One notes that the distribution of the differences between the random and fixed effects relative to the true (simulated) values is nearly identical for the linear case (second row). In other words, the use of the implicit constraint of the mixed model, offers no additional advantage when estimating individual performance in this model. On the other hand there is a value in applying mixed modeling techniques for the logistic regression case. In particular, outliers (such as those arising for small samples) are eliminated by the use of random effect modeling. The difference between the fixed and the random effect approach progressively decreases for large sample sizes, implying that the benefit of the latter approach is lost for “extra large” cluster sizes.

One way to put these differences into perspective is to realize that the random effects for the logistic model correspond to log-odd ratios, relative to the population mean. Hence the difference between the predicted random effect and its true value, when exponentiated, corresponds to an Odd Ratio (OR). A summary of the odds ratios over the population of the random effects as a function of cluster size is shown below.

Metric 20-100  200-1000 1000-2000 Min    0.5082   0.6665    0.7938 Q25    0.8901   0.9323    0.9536 Median 1.0330   1.0420    1.0190  Mean   1.0530   1.0410    1.0300   Q75    1.1740   1.1340    1.1000    Max    1.8390   1.5910    1.3160

 

Even though the average Odds Ratio is close to 1, a substantial number of predicted random effects are far from the true value and yield ORs that are greater than 11% in either direction for small cluster sizes. These observations have implications for the Score Card (or similar projects): if one were to use Random Effects modeling to focus on individuals, then unless the cluster sizes (observations per individual) are substantial, one would run a substantial risk of misclassifying individuals, even though one would be right on average!

One could wonder whether these differences between the simulated truth and the predicted random effects arise as a result of the numerical algorithms of the lme4 package. The latter was used by both the Surgeon Score Card project and our simulations so far and thus it would be important to verify that it performs up to specs. The major tuning variable for the algorithm is the order of the Adaptive Gaussian Quadrature (argument nAGQ). We did not find any substantial departures when the order of the quadrature was varied from 0 to 1 and 2. However, there is a possibility that the algorithm fails for all AGQ orders as it has to calculate probabilities that are numerically close to the boundary of the parameter space. We thus decided to fit the same model from a Bayesian perspective using Markov Chain Monte Carlo (MCMC) methods. The following code will fit the Bayesian model and graphs the true values of the effects used in the simulated dataset against the Bayesian estimates (the posterior mean) and also the lme4 predictions. The latter tend to be numerically close to the posterior mode of the random effects when a Bayesian perspective is adopted.

## Fit the mixed effects logistic model from R using openbugs library("glmmBUGS") library(nlme) fitBUGS = glmmBUGS(ev ~ 1, data=complications, effects="id", family="bernoulli") startingValues = fitBUGS$startingValues source("getInits.R") require(R2WinBUGS) fitBUGSResult = bugs(fitBUGS$ragged, getInits, parameters.to.save = names(getInits()), model.file="model.txt", n.chain=3, n.iter=6000, n.burnin=2000, n.thin=10, program="openbugs", working.directory=getwd()) fitBUGSParams = restoreParams(fitBUGSResult , fitBUGS$ragged) sumBUGS<-summaryChain(fitBUGSParams ) checkChain(fitBUGSParams ) ## extract random effects cnames<-as.character(sort(as.numeric(row.names(sumBUGS$FittedRid)))) fitBUGSRE<-sumBUGS$Rid[cnames,1] ## plot against the simulated (true) effects and the lme4 estimates hist(ind,xlab="RE",ylim=c(0,3.8),freq=FALSE,main="") lines(density(fitBUGSRE),main="Bayesian",xlab="RE",col="blue") lines(density(ran2),col="red") legend(legend=c("Truth","lme4","MCMC"),col=c("black","red","blue"), bty="n",x=0.2,y=3,lwd=1)

The following figure shows the histogram of the true values of the random effects (black), the frequentist(lme4) estimates (red) and the Bayesian posterior means (blue).

 

It can be appreciated that both the Bayesian estimates and the lme4 predictions demonstrate considerable shrinkage relative to the true values for small cluster sizes (20-100). Hence, an lme4 numerical quirk seems an unlikely explanation for the shrinkage observed in the simulation.

Summing up:

  • Random Effect modeling of binomial outcomes require a substantial number of observations per individual (cluster size) for the procedure to yield estimates of individual effects that are numerically indistinguishable  from the true values
  • Fixed effect modeling is even worse an approach for this problem
  • Bayesian fitting procedures do not appear to yield numerically different effects from their frequentist counterparts

These features should raise the barrier for accepting a random effects logistic modeling approach when the focus is on individual rather than population average effects. Even though the procedure is certainly preferable to fixed effects regression, the direct use of the value of the predicted individual random effects as an effect measure will be problematic for small cluster sizes (e.g. a small number of procedures per surgeon). In particular, a substantial proportion of these estimated effects is likely to be far from the truth even if the model is unbiased on the average. These observations are of direct relevance to the Surgical Score Card in which the number of observations per surgeon were far lower than the average value in our simulations: 60 (small), 600 (large) and 1500 (extra large):

Procedure Code N (procedures) N(surgeons) Procedures per surgeon 51.23 201,351 21,479 9.37 60.5 78,763 5,093 15.46 60.29 73,752 7,898 9.34 81.02 52,972 5,624 9.42 81.07 106,689 6,214 17.17 81.08 102,716 6,136 16.74 81.51 494,576 13,414 36.87 81.54 1,190,631 18,029 66.04 Total 2,301,450 83,887 27.44

To leave a comment for the author, please follow the link and comment on his blog: Statistical Reflections of a Medical Doctor » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

#rstats Make arrays into vectors before running table

Wed, 2015-07-29 22:15

(This article was first published on A HopStat and Jump Away » Rbloggers, and kindly contributed to R-bloggers)

Setup of Problem

While working with nifti objects from the oro.nifti, I tried to table the values of the image. The table took a long time to compute. I thought this was due to the added information about a medical image, but I found that the same sluggishness happened when coercing the nifti object to an array as well.

Quick, illustrative simulation

But, if I coerced the data to a vector using the c function, things were much faster. Here's a simple example of the problem.

library(microbenchmark) dim1 = 30 n = dim1 ^ 3 vec = rbinom(n = n, size = 15, prob = 0.5) arr = array(vec, dim = c(dim1, dim1, dim1)) microbenchmark(table(vec), table(arr), table(c(arr)), times = 100) Unit: milliseconds expr min lq mean median uq max table(vec) 5.767608 5.977569 8.052919 6.404160 7.574409 51.13589 table(arr) 21.780273 23.515651 25.050044 24.367534 25.753732 68.91016 table(c(arr)) 5.803281 6.070403 6.829207 6.786833 7.374568 9.69886 neval cld 100 a 100 b 100 a

As you can see, it's much faster to run table on the vector than the array, and the coercion of an array to a vector doesn't take much time compared to the tabling and is comparable in speed.

Explanation of simulation

If the code above is clear, you can skip this section. I created an array that was 30 × 30 × 30 from random binomial variables with half probabily and 15 Bernoulli trials. To keep things on the same playing field, the array (arr) and the vector (vec) have the same values in them. The microbenchmark function (and package of the same name) will run the command 100 times and displays the statistics of the time component.

Why, oh why?

I've looked into the table function, but cannot seem to find where the bottleneck occurs. Now, for and array of 30 × 30 × 30, it takes less than a tenth of a second to compute. The problem is when the data is 512 × 512 × 30 (such as CT data), the tabulation using the array form can be very time consuming.

I reduced the replicates, but let's show see this in a reasonable image dimension example:

library(microbenchmark) dims = c(512, 512, 30) n = prod(dims) vec = rbinom(n = n, size = 15, prob = 0.5) arr = array(vec, dim = dims) microbenchmark(table(vec), table(arr), table(c(arr)), times = 10) Unit: seconds expr min lq mean median uq max table(vec) 1.871762 1.898383 1.990402 1.950302 1.990898 2.299721 table(arr) 8.935822 9.355209 9.990732 10.078947 10.449311 11.594772 table(c(arr)) 1.925444 1.981403 2.127866 2.018741 2.222639 2.612065 neval cld 10 a 10 b 10 a Conclusion

I can't figure out why right now, but it seems that coercing an array (or nifti image) to a vector before running table can significantly speed up the procedure. If anyone has any intuition why this is, I'd love to hear it. Hope that helps your array tabulations!

To leave a comment for the author, please follow the link and comment on his blog: A HopStat and Jump Away » Rbloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R Oddities: Strings in DataFrames

Wed, 2015-07-29 21:31

(This article was first published on R-Chart, and kindly contributed to R-bloggers)

Have you ever read a file into R and then encountered strange problems filtering and sorting because the strings were converted to factors?  For instance, you might think the two data frames, df and df2 below are contain the same data

> df<-data .frame="" c="" hello="" stringsasfactors="FALSE)</b" world="">
write.csv(df, ‘df.csv’)
df2<-read .csv="" b="" df.csv="">

But look the dimensions are different

> dim(df)
[1] 1 1
> dim(df2)
[1] 1 2

And on further analysis – they are indeed different classes:

> class(df$V1)
[1] “character”
> class(df2$V1)
[1] “factor”

This behavior is very confusing to many when first introduced to R.  Over time, I accepted it but didn’t really understand how or why it originated.   Roger Peng over at the Simply Statistics blog has a great write up on why R operates this way:

http://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/

To leave a comment for the author, please follow the link and comment on his blog: R-Chart. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

But I Don’t Want to Be a Statistician!

Wed, 2015-07-29 19:42

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

“For a long time I have thought I was a statistician…. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt…. All in all, I have come to feel that my central interest is in data analysis….”

Opening paragraph from John Tukey “The Future of Data Analysis” (1962) To begin, we must acknowledge that these labels are largely administrative based on who signs your paycheck. Still, I prefer the name “data analysis” with its active connotation. I understand the desire to rebrand data analysis as “data science” given the availability of so much digital information. As data has become big, it has become the star and the center of attention.

We can borrow from Breiman’s two cultures of statistical modeling to clarify the changing focus. If our data collection is directed by a generative model, we are members of an established data modeling community and might call ourselves statisticians. On the other hand, the algorithmic modeler (although originally considered a deviant but now rich and sexy) took whatever data was available and made black box predictions. If you need a guide to applied predictive modeling in R, Max Kuhn might be a good place to start.

Nevertheless, causation keeps sneaking in through the back door in the form of causal networks. As an example, choice modeling can be justified as an “as if” predictive modeling but then it cannot be used for product design or pricing. As Judea Pearl notes, most data analysis is “not associational but causal in nature.”

Does an inductive bias or schema predispose us to see the world as divided into causes and effects with features creating preference and preference impacting choice? Technically, the hierarchical Bayes choice model does not require the experimental manipulation of feature levels, for example, reporting the likelihood of bus ridership for individuals with differing demographics. Even here, it is difficult not be see causation at work with demographics becoming stereotypes. We want to be able to turn the dial, or at least selection different individuals, and watch choices change. Are such cognitive tendencies part of statistics?

Moreover, data visualization has always been an integral component in the R statistical programming language. Is data visualization statistics? And what of presentations like Hans Rosling’s Let My Dataset Change Your Mindset? Does statistics include argumentation and persuasion?

Hadley Wickham and the Cognitive Interpretation of Data Analysis

You have seen all of his data manipulation packages in R, but you may have missed the theoretical foundations in the paper “A Cognitive Interpretation of Data Analysis” by Grolemund and Wickham. Sensemaking is offered as an organizing force with data analysis as an external tool to aid understanding. We can make sensemaking less vague with an illustration.

Perceptual maps are graphical displays of a data matrix such as the one below from an earlier post showing the association between 14 European car models and 27 attributes. Our familiarity with Euclidean spaces aid in the interpretation of the 14 x 27 association table. It summarizes the data using a picture and enables us to speak of repositioning car models. The joint plot can be seen as the competitive landscape and soon the language of marketing warfare brings this simple 14 x 27 table to life. Where is the high ground or an opening for a new entry? How can we guard against an attack from below? This is sensemaking, but is it statistics?

I consider myself to be a marketing researcher, though with a PhD, I get more work calling myself a marketing scientist. I am a data analyst and not a statistician, yet in casual conversation I might say that I am a statistician in the hope that the label provides some information. It seldom does.

I deal in sensemaking. First, I attempt to understand how consumers make sense of products and decide what to buy. Then, I try to represent what I have learned in a form that assists in strategic marketing. My audience has no training in research or mathematics. Statistics plays a role and R helps, but I never wanted to be a statistician. Not that there is anything wrong with that.

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Mapping the past and the future with Leaflet

Wed, 2015-07-29 19:21

(This article was first published on 56north | Skræddersyet dataanalyse » Renglish, and kindly contributed to R-bloggers)

I have been working on mapping things for a while and I must say that I really like the Leaflet package from Rstudio. It makes it very easy and straight forward to make leaflet maps.

A while back I stumbled upon an interactive graphic from The Times, that used census data to compare each US state with the country as a whole to then show if this state was comparable to the US in the past, present or future (go to the link if this sounds confusing).

So I decided I wanted to do the same with danish data. Statistics Denmark has three relevant data sets for creating such a map:

  1. FRDK115: Population projections 2015 for the country by ancestry, sex and age
  2. FOLK2: Population 1. January by sex, age, ancestry, country of origin and citizenship
  3. FOLK1: Population at the first day of the quarter by municipality, sex, age, marital status, ancestry, country of origin and citizenship

FRDK115 show us the future of Denmark, FOLK2 shows us the past and FOLK1 show us the present.

In this data I can see age and ancestry, so I am able to make two kinds of maps: one that shows the present mean age of each danish municipality and comparing them with the country as a whole from 1980 – 2050, and one map that shows the present ancestry distribution in each danish municipality and comparing them with the country as a whole as well.

I colored the map to show what year each municipality was indicative of where yellow indicates the past and red the future. Each municipality got a little pop-up box that when you click it explains why its colored the way it is.

So two patterns emerge:

  1. Urban Denmark looks like Denmark in the past when it comes to age
  2. Urban Denmark looks like Denmark in the future when it comes to ancestry

If we want to see what kind of challenges Denmark will see in the future regarding an aging population we can look to some of the deep red municipalities in the outskits of Denmark on the age-map, whereas if we want to see what kind of challenges Denmark will see in the future regarding a more diverse ancestry we can look to some of the deep red municipalities in and around the capital area on the ancestry-map.

Here is an example:

You can see the full interactive leaflet maps here:

The Past and Future of Age in Denmark

The Past and Future of Ancestry in Denmark

 

Now go make your very own cool maps! There is a guide on R-bloggers that gives a good introduction to the Leaflet package.

To leave a comment for the author, please follow the link and comment on his blog: 56north | Skræddersyet dataanalyse » Renglish. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Player Value Gap Assessment

Wed, 2015-07-29 18:42

(This article was first published on Fantasy Football Analytics » R | Fantasy Football Analytics, and kindly contributed to R-bloggers)

Looking at fantasy football projections we have a group of experts providing their views on how a player will do during the football season. We have collected projections from several sources and are able to analyze them in detail.

Football is a balanced game in the sense that all passing yards should be matched by receiving yards, passing touchdowns should match receiving touchdowns, and pass completions should match pass receptions. What information can the projections offer us about value gaps if we examine these stat categories? If a team’s passing yards is projected to be higher than the receiving yards, then you can ask yourself: are the QB passing yards higher than they should be or are the receivers’ reception yards lower than they should be? In any case it may provide some room to explore some value gaps. Using historical data we can look at the accuracy of these stat categories and see if either of these are under- or over-estimated.

Passing and receiving yards

Let’s take a look at the data for the 2015 season. We have data from 15 different sources and if we add up all the passing yards projected for the QBs for a team and compare those to the sum of projected receiving yards for RBs, WRs, and TEs the picture below emerges:

At one end we have New Orleans with on average 453 passing yards more than receiving yards; at the other end is Chicago with on average 314 passing yards less than receiving yards. If we take the two data points at each end we could argue two different things for each:

  • Drew Brees is over-rated. In a standard league setup with 1 point per 25 yards we are looking at taking about 18 points off his projections.
  • His receivers are under-rated. Brandin Cooks accounts for about 22% of the projected reception yards, so to get his share of the gap he would have almost 100 yards more which would translate into another 10 points for him.
  • Jay Cutler is under-rated. If Jay Culter can pass for another 314 yards his point projections are about 12 points short. Now you could argue that Jay Culter could be benched hallway through the season, but the number represents the team total of passing yards so it includes all QBs on the team.
  • Again, if Jay Cutler is not under-rated then his receivers are over-rated. Alshon Jeffrey accounts for about 27% of the projected receiving yards so he could take an 85 yard hit on the projections or about 8–9 points.

In general it looks like there is a tendency to project passing yards to be less than receiving yards, since there are more teams with more passing yards than the other way around. However, projected passing yards are only about 34 yards more than projected receiving yards, on average.

Passing and receiving touchdowns

Touchdowns should also be balanced out so passing touchdowns are equal to receiving touchdowns, but as the graph below shows, passing touchdowns are projected to be higher than receiving touchdowns.

Again New Orleans is to be found on the higher end with 4.57 more passing touchdowns than receiving touchdowns, only surpassed by Denver with 4.60 more passing touchdowns than receiving touchdowns. If a standard 4 points is awarded for passing touchdowns then Drew Brees’s projections could be inflated by another 18 points which along with the 18 points from above then puts him at risk to lose 36 points total. That could mean a significant drop in positional ranking.

On the receiver side for New Orleans, Brandin Cooks accounts for about 20% of the projected receiving touchdowns, so he could see an extra touchdown which could put him up another 6 points or a total of 16 points if you count the receiving yards as well.

Pass completions and receptions

The final indicator on value gaps comes from comparing pass completions to pass receptions. The general picture here is that pass completions here are always projected lower than pass receptions, but in this case New Orleans is pretty balanced with just 2 more pass receptions than completions.

This set of data makes the assessment complete in the sense that you will have to look at the three stats to see what value gaps there may be. For New Orleans the number of completed passes almost matches the number of receptions by offensive players, but passing yards and touchdowns are higher than receiving yards and touchdowns. So yards per completion is too high for Drew Brees and/or yards per completion is too low for his receivers. This is where you can try and make an assessment for yourself. If you think it is a little bit of both then you can split the 453 yards and take some away from Brees and give it to the receivers. Same thing for the touchdowns – maybe split them 50/50 and take 2 away from Brees and give 2 to the receivers.

Conclusion

Although experts work hard to provide projections for football players, they do not always match up when looking at the data at the team level. For the 2015 projected stats we have seen that:

  • Receiving yards are mostly projected higher than passing yards, but on average by only about 34 yards
  • Passing touchdowns are almost always projected higher than receiving touchdowns, on average 1.5 passing touchdowns more than receiving touchdowns
  • Pass completions are always projected lower than pass receptions, on average 21.4 completions more than receptions

Looking at the individual teams you can explore these differences to find potential value gaps. However there is not a definite solution to close the value gap. Historical data can provide some insight into the accuracy of the stats and give an indication on the direction to take on closing this gap. If we go back and look at the data from 2008 through 2014 we find that both receiving and passing yards were over-estimated over that time period, receiving yards by 387 yards on average and passing yards by 283 yards on average compared to the actual values. So passing yards are slightly more accurate than receiving yards but both projections should likely be adjusted downward some. Projected passing and receiving touchdowns seem historically pretty accurate. In the time period from 2008 to 2014, both projected passing and receiving touchdowns are only a couple of touchdowns over the actual values. On the other hand, pass receptions and pass completions were both under-estimated in the same time period. Projected pass completions were on average 42 completions under the actual number of completions and pass receptions were on average 33 receptions under the actual number of receptions. Overall it seems that projected passing data is more accurate than receiving data and could point us to making larger adjustments of receiving projections than passing projections. So in reference to Drew Brees: looking at the historical data in general it does not provide a clear way of resolving the value gap, but for this season it could look like it is more likely that Brees is over-rated than it is that his wide receivers are under-rated. We will be dissecting the historical accuracy of the data in future posts, so stay tuned for that.

You can find the R script used for the analysis above here: https://github.com/dadrivr/FantasyFootballAnalyticsR/tree/master/R%20Scripts/Posts/Value%20Gap%20analysis/R%20Scripts

The post Player Value Gap Assessment appeared first on Fantasy Football Analytics.

To leave a comment for the author, please follow the link and comment on his blog: Fantasy Football Analytics » R | Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Predict Social Network Influence with R and H2O Ensemble Learning

Wed, 2015-07-29 15:39

(This article was first published on ThinkToStart » R Tutorials, and kindly contributed to R-bloggers)

What is H2O?

H2O is an awesome machine learning framework. It is really great for data scientists and business analysts “who need scalable and fast machine learning”. H2O is completely open source and what makes it important is that works right of the box. There seems to be no easier way to start with scalable machine learning. It hast support for R, Python, Scala, Java and also has a REST API and a own WebUI. So you can use it perfectly for research but also in production environments.

H2O is based on Apache Hadoop and Apache Spark which gives it enormous power with in-memory parallel processing.

H2O is the #1 Java-based open-source Machine Learning project on GitHub and is used by really a lot of well-known companies like PayPal.

Get the data

The data we will use to train our machine learning model is from a Kaggle competition from the year 2013. The challenge was organized by Data Science London and the UK Windows Azure Users Group who partnered with Microsoft and Peerindex.

The data was described as: “The dataset, provided by Peerindex, comprises a standard, pair-wise preference learning task. Each datapoint describes two individuals, A and B. For each person, 11 pre-computed, non-negative numeric features based on twitter activity (such as volume of interactions, number of followers, etc) are provided.

The binary label represents a human judgement about which one of the two individuals is more influential. A label ‘1’ means A is more influential than B. 0 means B is more influential than A. The goal of the challenge is to train a machine learning model which, for pairs of individuals, predicts the human judgement on who is more influential with high accuracy. Labels for the dataset have been collected by PeerIndex using an application similar to the one described in this post.”

Theory to build the ensemble model

With the h2oEnsemble package we can create ensembles with all available machine learning models in H2O. As the package authors explain: “This type of ensemble learning is called “super learning”, “stacked regression” or “stacking.” The Super Learner algorithm learns the optimal combination of the base learner fits.”

This is based on the article “Super Learner” published in the magazine “Statistical Applications in Genetics and Molecular Biology” by Mark J. van der Laan, Eric C Polley and Alan E. Hubbard from the University of California, Berkeley in 2007. They also created the “SuperLearner” package based on that: SuperLearner

Install the dependencies

First of all we need the h2o package. This is available at CRAN and so are the packages SuperLearner and cvAUC we also need to build our model.

install.packages(c("h2o","SuperLearner","cvAUC"))

The other and maybe most important package is the h2oEnsemble package. We can download it via devtools from the h2o GitHub repository.

require(devtools) install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")

Then we just have to load all the packages:

require(c("h2o","h2oEnsemble", "SuperLearner", "cvAUC")

 

Preparing the data

First we have to load the dataset with

dat<-read.csv("train.csv", header=TRUE)

and split it into a train and a test data set. The test dataset provided by the Kaggle challenge does not include output labels so we can not use it to test our machine learning model.

We split it by creating a train index that chooses 4000 line numbers from the data frame. We then subset it into train and test:

train_idx <- sample(1:nrow(dat),4000,replace=FALSE) train <- dat[train_idx,] # select all these rows test <- dat[-train_idx,] # select all but these rows

Preparing the ensemble model

We begin with starting an h2o instance directly out of our R console

localH2O = h2o.init()

 

After starting the H2O instance, we have to transform our data set a little bit further. We have to transform it into H2O data objects, define the output column and set the family variable to binomial as we want to do a binomial classification.

That is also the reason why we have to turn the output columns into factor variables.

training_frame <- as.h2o(localH2O, train) validation_frame <- as.h2o(localH2O, test) y <- "Choice" x <- setdiff(names(training_frame), y) family <- "binomial" training_frame[,c(y)] <- as.factor(training_frame[,c(y)]) #Force Binary classification validation_frame[,c(y)] <- as.factor(validation_frame[,c(y)]) # check to validate that this guarantees the same 0/1 mapping?

Then we can define the machine learning methods we want to include in our ensemble and set the meta learner from the SuperLearner package. in this case we will use a general linear model.

# Specify the base learner library & the metalearner learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper", "h2o.gbm.wrapper", "h2o.deeplearning.wrapper") metalearner <- "SL.glm"

Build the model to predict Social Network Influence

After defining all the parameters we can build our model.

fit <- h2o.ensemble(x = x, y = y, training_frame = training_frame, family = family, learner = learner, metalearner = metalearner, cvControl = list(V = 5, shuffle = TRUE))


 

Evaluate the performance

Evaluating the performance of a H2O model is pretty easy. But as the ensemble method is not fully implemented in the H2O package yet, we have to create a small workaround to get the accuracy of our model.

pred <- predict.h2o.ensemble(fit, validation_frame) labels <- as.data.frame(validation_frame[,c(y)])[,1] # Ensemble test AUC AUC(predictions=as.data.frame(pred$pred)[,1], labels=labels)

In my case the Accuracy was nearly 88%, which is a pretty good result.

You can also look at the accuracy of the the single models you used in your ensemble:

L <- length(learner)  sapply(seq(L), function(l) AUC(predictions = as.data.frame(pred$basepred)[,l], labels = labels))

 

The post Predict Social Network Influence with R and H2O Ensemble Learning appeared first on ThinkToStart.

To leave a comment for the author, please follow the link and comment on his blog: ThinkToStart » R Tutorials. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The most popular programming languages on StackOverflow

Wed, 2015-07-29 11:23

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

Last week, IEEE Spectrum said R rised to #6 in Top Programming languages. They use a weighted methodology of 12 factors to compute their score. Among these factors is the activity on social programming websites, including StackOverflow and Github.

I recently used data.stackexchange.com to query the total number of questions on StackOverflow using the R tag.

It is easy to extend the query to include all of the top 10 languages (according to IEEE Spectrum) and see if the StackOverflow activity tells us anything interesting.

This plot shows the monthly total number of questions in each tag on StackOverflow, since 2008. JavaScript and Java tops the list, with R in position #8, roughly on par with C.

 

There is a fair amount of noise in the form of monthly fluctuations. The underlying trends are slightly easier to observe by fitting a smoother through the data:

From this it is apparent that R is increasing its rank compared to the other languages. During the past year, it has been catching up to C and should overtake soon to be in position #7. By comparison, during most of 2013, R was in position #9 only.

Analysis

The IEEE spectrum study uses many other factors, including search rankings and job postings.  For R to jump to #6 in their study means that these other factors also play a role.

Nevertheless, it is interesting to compare the numbers on a single, very easily accessible metric.

Caveats

I didn't try to include any of the other languages in the StackExchange data query, using only the list of top 10. Overall ranking might change in my analysis with a more comprehensive list of tags to search for. (You can change the underlying query and add more tag names)

The code

Here is the code to make these plots:

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Introducing the nominatim geocoding package

Wed, 2015-07-29 10:20

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

In the never-ending battle for truth, justice and publishing more R
packages than Oliver, I whipped out an R
package for the OpenStreetMap Nominatim
API
. It actually hits the
MapQuest Nominatim Servers for
most of the calls, but the functionality is the same.

The R package lets you:

  • address_lookup: Lookup the address of one or multiple OSM objects
    like node, way or relation.
  • osm_geocode: Search for places by address
  • osm_search: Search for places
  • osm_search_spatial: Search for places, returning a list of
    SpatialPointsDataFrame, SpatialLinesDataFrame or a
    SpatialPolygonsDataFrame
  • reverse_geocode_coords: Reverse geocode based on lat/lon
  • reverse_geocode_osm: Reverse geocode based on OSM Type & Id

Just like Google Maps, these services are not meant to be your
freebie-access to mega-bulk-geocoding. You can and should pay for that.
But, when you need a few items geocoded (or want to lookup some
interesting things on OSM since it provides special
phrases

to work with), Nominatim lookups can be just what’s needed.

Let’s say we wanted to see where pubs are in the Seattle metro area.
That’s a simple task for nominatim:

# devtools::install_github("hrbrmstr/nominatim") library(nominatim) library(dplyr)   sea_pubs <- osm_search("pubs near seattle, wa", limit=20)   glimpse(sea_pubs)   ## Observations: 20 ## Variables: ## $ place_id (chr) "70336054", "82743439", "11272568", "21478701", "... ## $ licence (chr) "Data © OpenStreetMap contributors, ODbL 1.0. htt... ## $ osm_type (chr) "way", "way", "node", "node", "node", "node", "no... ## $ osm_id (chr) "51460516", "96677583", "1077652159", "2123245933... ## $ lat (dbl) 47.64664, 47.63983, 47.60210, 47.62438, 47.59203,... ## $ lon (dbl) -122.3503, -122.3023, -122.3321, -122.3559, -122.... ## $ display_name (chr) "Nickerson Street Saloon, 318, Nickerson Street, ... ## $ class (chr) "amenity", "amenity", "amenity", "amenity", "amen... ## $ type (chr) "pub", "pub", "pub", "pub", "pub", "pub", "pub", ... ## $ importance (dbl) 0.201, 0.201, 0.201, 0.201, 0.201, 0.201, 0.201, ... ## $ icon (chr) "http://mq-open-search-int-ls03.ihost.aol.com:800... ## $ bbox_left (dbl) 47.64650, 47.63976, 47.60210, 47.62438, 47.59203,... ## $ bbox_top (dbl) 47.64671, 47.63990, 47.60210, 47.62438, 47.59203,... ## $ bbox_right (dbl) -122.3504, -122.3025, -122.3321, -122.3559, -122.... ## $ bbox_bottom (dbl) -122.3502, -122.3022, -122.3321, -122.3559, -122....

We can even plot those locations:

library(rgdal) library(ggplot2) library(ggthemes) library(sp) library(DT)   # Grab a neighborhood map of Seattle url <- "https://data.seattle.gov/api/file_data/VkU4Er5ow6mlI0loFhjIw6eL6eKEYMefYMm4MGcUakU?filename=Neighborhoods.zip" fil <- "seattle.zip" if (!file.exists(fil)) download.file(url, fil) if (!dir.exists("seattle")) unzip(fil, exdir="seattle")   # make it usable sea <- readOGR("seattle/Neighborhoods/WGS84/Neighborhoods.shp", "Neighborhoods")   ## OGR data source with driver: ESRI Shapefile ## Source: "seattle/Neighborhoods/WGS84/Neighborhoods.shp", layer: "Neighborhoods" ## with 119 features ## It has 12 fields   sea_map <- fortify(sea)   # Get the extenes of where the pubs are so we can "zoom in" bnd_box <- bbox(SpatialPoints(as.matrix(sea_pubs[, c("lon", "lat")])))   # plot them gg <- ggplot() gg <- gg + geom_map(data=sea_map, map=sea_map, aes(x=long, y=lat, map_id=id), color="black", fill="#c0c0c0", size=0.25) gg <- gg + geom_point(data=sea_pubs, aes(x=lon, y=lat), color="#ffff33", fill="#ff7f00", shape=21, size=4, alpha=1/2) # decent projection for Seattle-y things and expand the zoom/clip a bit gg <- gg + coord_map("gilbert", xlim=extendrange(bnd_box["lon",], f=0.5), ylim=extendrange(bnd_box["lat",], f=0.5)) gg <- gg + labs(title="Seattle Pubs") gg <- gg + theme_map() gg <- gg + theme(title=element_text(size=16)) gg

Of course you can geocode:

addrs <- osm_geocode(c("1600 Pennsylvania Ave, Washington, DC.", "1600 Amphitheatre Parkway, Mountain View, CA", "Seattle, Washington")) addrs %>% select(display_name)   ## Source: local data frame [3 x 1] ## ## display_name ## 1 Washington, District of Columbia, United States of America ## 2 Mountainview Lane, Huntington Beach, Orange County, California, 92648, Unit ## 3 Seattle, King County, Washington, United States of America   addrs %>% select(lat, lon)   ## Source: local data frame [3 x 2] ## ## lat lon ## 1 38.89495 -77.03665 ## 2 33.67915 -118.02588 ## 3 47.60383 -122.33006

Or, reverse geocode:

# Reverse geocode Canadian embassies # complete list of Canadian embassies here: # http://open.canada.ca/data/en/dataset/6661f0f8-2fb2-46fa-9394-c033d581d531   embassies <- data.frame(lat=c("34.53311", "41.327546", "41.91534", "36.76148", "-13.83282", "40.479094", "-17.820705", "13.09511", "13.09511"), lon=c("69.1835", "19.818698", "12.50891", "3.0166", "-171.76462", "-3.686115", "31.043559", "-59.59998", "-59.59998"), stringsAsFactors=FALSE)   emb_coded_coords <- reverse_geocode_coords(embassies$lat, embassies$lon)   emb_coded_coords %>% select(display_name)   ## Source: local data frame [9 x 1] ## ## display_name ## 1 Embassy of Canada, Ch.R.Wazir Akbar Khan, Kabul, Afghanistan ## 2 Monumenti i Skënderbeut, Skanderbeg Square, Lulishtja Këshilli i Europëes, ## 3 Nomentana/Trieste, Via Nomentana, San Lorenzo, Salario, Municipio Roma II, ## 4 18, Avenue Khalef Mustapha, Ben Aknoun, Daïra Bouzareah, Algiers, Ben aknou ## 5 The Hole in the Wall, Beach Road, Āpia, Samoa ## 6 Torre Espacio, 259 D, Paseo de la Castellana, Fuencarral, Fuencarral-El Par ## 7 Leopold Takawira Street, Avondale West, Harare, Harare Province, 00263, Zim ## 8 Bishop's Court Hill, Bridgetown, Saint Michael, Barbados ## 9 Bishop's Court Hill, Bridgetown, Saint Michael, Barbados

It can even return Spatial objects (somewhat experimental):

# stock example search from OSM osm_search_spatial("[bakery]+berlin+wedding", limit=5)[[1]]   ## coordinates place_id ## 1 (13.34931, 52.54165) 9039748 ## 2 (13.34838, 52.54125) 2659941153 ## 3 (13.35678, 52.55138) 23586341 ## 4 (13.34985, 52.54158) 7161987 ## 5 (13.35348, 52.5499) 29179742 ## licence ## 1 Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright ## 2 Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright ## 3 Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright ## 4 Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright ## 5 Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright ## osm_type osm_id lat lon ## 1 node 939667448 52.54165 13.34931 ## 2 node 3655549445 52.54125 13.34838 ## 3 node 2299953786 52.55138 13.35678 ## 4 node 762607353 52.54158 13.34985 ## 5 node 2661679367 52.54990 13.35348 ## display_name ## 1 Baguetterie, Föhrer Straße, Brüsseler Kiez, Wedding, Mitte, Berlin, 13353, Germany ## 2 Föhrer Cafe & Backshop, Föhrer Straße, Brüsseler Kiez, Wedding, Mitte, Berlin, 13353, Germany ## 3 Körfez, Amsterdamer Straße, Leopoldkiez, Wedding, Mitte, Berlin, 13347, Germany ## 4 Knusperbäcker, Torfstraße, Brüsseler Kiez, Wedding, Mitte, Berlin, 13353, Germany ## 5 Hofbäckerei, Müllerstraße, Brüsseler Kiez, Wedding, Mitte, Berlin, 13353, Germany ## class type importance ## 1 shop bakery 0.201 ## 2 shop bakery 0.201 ## 3 shop bakery 0.201 ## 4 shop bakery 0.201 ## 5 shop bakery 0.201 ## icon ## 1 http://mq-open-search-int-ls04.ihost.aol.com:8000/nominatim/v1/images/mapicons/shopping_bakery.p.20.png ## 2 http://mq-open-search-int-ls04.ihost.aol.com:8000/nominatim/v1/images/mapicons/shopping_bakery.p.20.png ## 3 http://mq-open-search-int-ls04.ihost.aol.com:8000/nominatim/v1/images/mapicons/shopping_bakery.p.20.png ## 4 http://mq-open-search-int-ls04.ihost.aol.com:8000/nominatim/v1/images/mapicons/shopping_bakery.p.20.png ## 5 http://mq-open-search-int-ls04.ihost.aol.com:8000/nominatim/v1/images/mapicons/shopping_bakery.p.20.png ## bbox_left bbox_top bbox_right bbox_bottom ## 1 52.5416504 52.5416504 13.349306 13.349306 ## 2 52.5412496 52.5412496 13.3483832 13.3483832 ## 3 52.5513806 52.5513806 13.3567785 13.3567785 ## 4 52.54158 52.54158 13.3498507 13.3498507 ## 5 52.5499029 52.5499029 13.3534756 13.3534756

The lookup functions are vectorized but there’s a delay built in to
avoid slamming the free servers.

Some things on the TODO list are:

  • enabling configuration of timeouts
  • enabling switching Nominatim API server providers (you can host your
    own!)
  • better Spatial support

So, give the code a spin and
submit feature requests/issues to github!

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Computing AIC on a Validation Sample

Wed, 2015-07-29 07:29

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

This afternoon, we’ve seen in the training on data science that it was possible to use AIC criteria for model selection.

> library(splines) > AIC(glm(dist ~ speed, data=train_cars, family=poisson(link="log"))) [1] 438.6314 > AIC(glm(dist ~ speed, data=train_cars, family=poisson(link="identity"))) [1] 436.3997 > AIC(glm(dist ~ bs(speed), data=train_cars, family=poisson(link="log"))) [1] 425.6434 > AIC(glm(dist ~ bs(speed), data=train_cars, family=poisson(link="identity"))) [1] 428.7195

And I’ve been asked why we don’t use a training sample to fit a model, and then use a validation sample to compare predictive properties of those models, penalizing by the complexity of the model.    But it turns out that it is difficult to compute the AIC of those models on a different dataset. I mean, it is possible to write down the likelihood (since we have a Poisson model) but I want a code that could work for any model, any distribution….

Hopefully, Heather suggested a very clever idea, using her package

And actually, it works well.

Create here two datasets, one for the training, and one for validation

> set.seed(1) > idx = sample(1:50,size=10,replace=FALSE) > train_cars = cars[-idx,] > valid_cars = cars[idx,]

then use simply

> library(gnm) > reg1 = gnm(dist ~ speed, data=train_cars, family=poisson(link="log")) > reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="log"))

Here Akaike criteria on the validation sample is

> AIC(reg2) [1] 82.57612

Let us keep tracks of a prediction to plot it later on

> v_log=predict(reg1,newdata= data.frame(speed=u),type="response")

We can challenge that Poisson model with a log link with a Poisson with a linear link function

> reg1 = gnm(dist ~ speed, data=train_cars, family=poisson(link="identity")) > reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="identity")) > AIC(reg2) [1] 73.24745 > v_id=predict(reg1,newdata=data.frame(speed=u), type="response")

We can also try to include splines, either with the log link

> library(splines) > reg1 = gnm(dist ~ bs(speed), data=train_cars, family=poisson(link="log")) > reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="log")) > AIC(reg2) [1] 82.57612 > v_log_bs=predict(reg1,newdata= data.frame(speed=u),type="response")

or with the identity link (but always in the same family)

> reg1 = gnm(dist ~ bs(speed), data=train_cars, family=poisson(link="identity")) > reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="identity")) > AIC(reg2) [1] 73.24745 > v_id_bs=predict(reg1,newdata= data.frame(speed=u),type="response")

If we plot the predictions, we get

> plot(cars) > points(train_cars,pch=19,cex=.85,col="grey") > lines(u,v_log,col="blue") > lines(u,v_id,col="red") > lines(u,v_log_bs,col="blue",lty=2) > lines(u,v_id_bs,col="red",lty=2)

Now, the problem with this holdout technique is that we might get unlucky (or lucky) when creating the samples. So why not try some monte carlo study, where many samples are generated,

> four_aic=function(i){ + idx = sample(1:50,size=10,replace=FALSE) + train_cars = cars[-idx,] + valid_cars = cars[idx,] + reg1 = gnm(dist ~ speed, data=train_cars, family=poisson(link="log")) + reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="log")) + a1=AIC(reg2) + reg0 = lm(dist ~ speed, data=train_cars) + reg1 = gnm(dist ~ speed, data=train_cars, family=poisson(link="identity"), start=c(1,1)) + reg2 = gnm(dist ~ speed, data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="identity"), start=c(1,1)) + a2=AIC(reg2) + reg1 = gnm(dist ~ bs(speed), data=train_cars, family=poisson(link="log")) + reg2 = gnm(dist ~ bs(speed), data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="log")) + a3=AIC(reg2) + reg1 = gnm(dist ~ bs(speed), data=train_cars, family=poisson(link="identity")) + reg2 = gnm(dist ~ bs(speed), data=valid_cars, contrain = "*", contrainTo = coef(reg1),family=poisson(link="identity")) + a4=AIC(reg2) + return(c(a1,a2,a3,a4))}

Consider for instance 1,000 scenarios

> S = sapply(1:1000,four_aic)

The model that has, on average, the lowest AIC on a validation sample was the log-link with splines

> rownames(S) = c("log","id","log+bs","id+bs") > W = apply(S,2,which.min) > barplot(table(W)/10,names.arg=rownames(S))

And indeed,

> boxplot(t(S))

with that model, the AIC is usually lower with the spline model with a log link than the other one (or at least almost the same as the spline model with an identity link). Or at least, we can confirm here that a nonlinear model should be better than a nonlinear one.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Mongolite 0.5: authentication and iterators

Tue, 2015-07-28 20:00

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

A new version of the mongolite package has appeared on CRAN. Mongolite builds on jsonlite to provide a simple, high-performance MongoDB client for R, which makes storing small or large data in a database as easy as converting it to/from JSON. Have a look at the vignette or useR2015 slides to get started with inserting, json queries, aggregation and map-reduce.

Authentication and mongolabs

This release fixes an issue with the authentication mechanism that was reported by Dean Attali. The new version should properly authenticate to secured mongodb servers.

Try running the code below to grab some flights data from my mongolabs server:

# load the package library(mongolite) stopifnot(packageVersion("mongolite") >= "0.5") # Connect to the 'flights' dataset flights <- mongo("flights", url = "mongodb://readonly:test@ds043942.mongolab.com:43942/jeroen_test") # Count data for query flights$count('{"day":1,"month":1}') # Get data for query jan1_flights <- flights$find('{"day":1,"month":1}')

While debugging this, I found that mongolab is actually very cool. You can sign up for a your own free (up to 500MB) mongodb server and easily create data collections with one or more read-only and/or read-write user accounts. This provides a pretty neat way to publish some data (read-only) or sync and collaborate with colleagues (read-write).

Iterators

Another feature request from some early adopters was to add support for iterators. Usually you want to use the mongo$find() method which automatically converts data from a query into a dataframe. However sometimes you need finer control over the individual documents.

The new version adds a mongo$iterate() method to manually iteratate over the individual records from a query without any automatic simplification. Using the same example query as above:

# Connect to the 'flights' dataset flights <- mongo("flights", url = "mongodb://readonly:test@ds043942.mongolab.com:43942/jeroen_test") # Create iterator iter <- flights$iterate('{"day":1,"month":1}') # Iterate over individual records while(length(doc <- iter$one())){ # do something with the row here print(doc) }

Currently the iterator has 3 methods: one(), batch(n = 1000) and page(n = 1000). The iter$one method will pop one document from iterator (it would be called iter$next() if that was not a reserved keyword in R). Both iter$batch(n) and iter$page(n) pop n documents at once. The difference is that iter$batch returns a list of at most length n whereas iter$page returns a data frame with at most n rows.

Once the iterator is exhausted, its methods will only return NULL.

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

I loved this %>% crosstable

Tue, 2015-07-28 18:12

(This article was first published on » R, and kindly contributed to R-bloggers)

This is a public tank you for @heatherturner’s contribution. Now the SciencesPo’s crosstable can work in a chain (%>%) fashion; useful for using along with other packages that have integrated the magrittr operator.

> candidatos %>% + filter(desc_cargo == 'DEPUTADO ESTADUAL'| desc_cargo =='DEPUTADO DISTRITAL' | desc_cargo =='DEPUTADO FEDERAL' | desc_cargo =='VEREADOR' | desc_cargo =='SENADOR') %>% tab(desc_cargo,desc_sexo) ==================================================== desc_sexo ------------------------- desc_cargo NA FEMININO MASCULINO Total ---------------------------------------------------- DEPUTADO DISTRITAL 1 826 2457 3284 0.03% 25% 75% 100% DEPUTADO ESTADUAL 122 12595 48325 61042 0.20% 21% 79% 100% DEPUTADO FEDERAL 40 5006 20176 25222 0.16% 20% 80% 100% SENADOR 4 161 1002 1167 0.34% 14% 86% 100% VEREADOR 9682 376576 1162973 1549231 0.62% 24% 75% 100% ---------------------------------------------------- Total 9849 395164 1234933 1639946 0.60% 24% 75% 100% ==================================================== Chi-Square Test for Independence Number of cases in table: 1639946 Number of factors: 2 Test for independence of all factors: Chisq = 1077.4, df = 8, p-value = 2.956e-227 X^2 df P(> X^2) Likelihood Ratio 1216.0 8 0 Pearson 1077.4 8 0 Phi-Coefficient : 0.026 Contingency Coeff.: 0.026 Cramer's V : 0.018

# Reproducible example:

library(SciencesPo) gender = rep(c("female","male"),c(1835,2691)) admitted = rep(c("yes","no","yes","no"),c(557,1278,1198,1493)) dept = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"), c(89,17,202,131,94,24,19,8,391,244,299,317)) dept2 = rep(c("A","B","C","D","E","F","A","B","C","D","E","F"), c(512,353,120,138,53,22,313,207,205,279,138,351)) department = c(dept,dept2) ucb = data.frame(gender,admitted,department) > ucb %>% tab(admitted, gender, department) ================================================================ department ----------------------------------------- admitted gender A B C D E F Total ---------------------------------------------------------------- no female 19 8 391 244 299 317 1278 1.5% 0.63% 31% 19% 23.4% 25% 100% male 313 207 205 279 138 351 1493 21.0% 13.86% 14% 19% 9.2% 24% 100% ------------------------------------------------------- Total 332 215 596 523 437 668 2771 12.0% 7.76% 22% 19% 15.8% 24% 100% ---------------------------------------------------------------- yes female 89 17 202 131 94 24 557 16% 3.1% 36% 24% 16.9% 4.3% 100% male 512 353 120 138 53 22 1198 43% 29.5% 10% 12% 4.4% 1.8% 100% ------------------------------------------------------- Total 601 370 322 269 147 46 1755 34% 21.1% 18% 15% 8.4% 2.6% 100% ---------------------------------------------------------------- Total female 108 25 593 375 393 341 1835 5.9% 1.4% 32% 20% 21.4% 19% 100% male 825 560 325 417 191 373 2691 30.7% 20.8% 12% 15% 7.1% 14% 100% ------------------------------------------------------- Total 933 585 918 792 584 714 4526 20.6% 12.9% 20% 17% 12.9% 16% 100% ================================================================

To leave a comment for the author, please follow the link and comment on his blog: » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Pluto: To Catch an Icy King

Tue, 2015-07-28 16:06

(This article was first published on Graph of the Week, and kindly contributed to R-bloggers)

Sly as a fox, it is. Mysterious and diminutive, it has eluded us for decades. Despite what we’ve learned about Pluto, constant debate continues to rage over its classification. From the moment it was discovered, astronomers have bickered over this icy body and its place in our solar system. Was it Planet X? Is it a planet at all? Did it really ‘have it coming‘? We’ve all longed to know more about this categorization-resistant body which has stirred up so much controversy in news and astronomy circles alike. How did we get so riled up about an icy rock so far distant? To understand that, we must start at the beginning.



Planet X
Before there was Pluto, there was Planet X.
Allow me to set the scene for you: It is the mid-1800s in Europe and North America. People are migrating to cities en masse, lured by the economics of the Industrial Revolution. As the number of mechanical monstrosities increase, so too does the pace of scientific discovery. Charles Darwin has just published The Origin of Species (original full title and certified mouthful: “On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life“) which inflames the science-vs-religion debate. The planet Neptune is discovered. This, coupled with Uranus’ prior discovery in the late 1700s raises the possibility that more, undiscovered worlds exist in our solar system.
This article was written by Patrick Rhodes and published on June 4, 2015. Click here to read the rest of the article.

To leave a comment for the author, please follow the link and comment on his blog: Graph of the Week. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Goals for the New R Consortium

Tue, 2015-07-28 14:56

(This article was first published on r4stats.com » R, and kindly contributed to R-bloggers)

by Bob Muenchen

The recently-created R Consortium consists of companies that are deeply involved in R such as RStudio, Microsoft/Revolution Analytics, Tibco, and others. The Consortium’s goals include advancing R’s worldwide promotion and support, encouraging user adoption, and improving documentation and tools. Those are admirable goals and below I suggest a few specific examples that the consortium might consider tackling.

As I work with various organizations to help them consider migrating to R, common concerns are often raised. With thousands of packages to choose from, where do I start? Do packages go through any reliability testing? What if I start using a package and its developer abandons it?  These, and others, are valid concerns that the R Consortium could address.

Choosing Packages

New R users face a daunting selection of thousands of packages. Some guidance is provided by CRAN’s Task Views. In R’s early years, this area was quite helpful in narrowing down a package search. However, R’s success has decreased the usefulness of Task Views. For example, say a professor asks a grad student to look into doing a cluster analysis. In SAS, she’ll have to choose among seven procedures. When considering the Task View on the subject, she’ll be presented with 105 choices in six categories!  The greater selection is one of R’s strengths, but to encourage the adoption of R by a wider community it would be helpful to list the popularity of each package. The more popular packages are likely to be the most useful.

R functions are integrated into other software such as Alteryx, IBM SPSS Statistics, KNIME, and RapidMiner. Some are also called from R user interfaces such as Deducer, R Commander, and RATTLE. Within R, some packages depend on others, adding another vote of confidence. The R Consortium could help R users by documenting these various measures of popularity, perhaps creating an overall composite score.

Accuracy

People often ask how they can trust the accuracy (or reliability) of software that is written by a loosely knit group of volunteers, when there have even been notable lapses in the accuracy of commercial software developed by corporate teams [1]. Base R and its “recommended packages” are very well tested, and the details of the procedures are documented in The R Software Development Life Cycle. That set of software is substantial, the equivalent of Base SAS + GRAPH + STAT + ETS + IML + Enterprise Miner (excluding GUIs, Structural Equation Modeling, and Multiple Imputation, which are in add-on packages). Compared to SPSS, it’s the rough equivalent to IBM SPSS Base + Statistics + Advanced Stat. + Regression + Forecasting + Decision Trees + Neural Networks + Bootstrapping.

While that set is very capable, it still leaves one wondering about all the add-on packages. Performing accuracy tests is very time consuming work [2-5] and even changing the options on the same routine can affect accuracy [6].  Increasing the confidence that potential users have in R’s accuracy would help to increase the use of the software, one of the Consortium’s goals. So I suggest that they consider ways to increase the reliability testing of functions that are outside the main R packages.

Given the vast number of R packages available, it would be impossible for the Consortium to perform such testing on all packages. However, for widely used packages, it might behoove the Consortium to use its resources to develop such tests themselves. A web page that referenced Consortium testing, as well as testing from any source, would be helpful.

Package Longevity

If enough of a package’s developers got bored and moved on or, more dramatically, were hit by the proverbial bus, development would halt. Base R plus recommended packages has the whole R Development Core Team backing them up. Other packages are written by employees of companies. In such cases, it is unclear whether the packages are supported by the company or by the individual developer(s).

Using the citation function will list a package’s developers. The more there are, the better chance there is of someone taking over if the lead developer moves on. The Consortium could develop a rating system that would provide guidance along these lines. Nothing lasts forever, but knowing the support level a package has would be of great help when choosing which to use.

Encourage Support and Use of Key Generic Functions

Some fairly new generic functions play a key role in making R easier to use. For example, David Robinson’s broom package contains functions that translate the output of modeling functions from list form into data frames, making output management much easier. Other packages, including David Dahl’s xtable and Philip Leifeld’s texreg, do a similar translation to present the output in nicely formatted forms for publishing. Those developers have made major contributions to R by writing all the methods themselves. The R Consortium could develop a list of such functions and encourage other developers to add methods to them, when appropriate. Such widely applicable functions could also benefit from having the R Consortium support their development, assuring longer package longevity and wider use.

Output to Microsoft Word

R has the ability to create beautiful output in almost any format you would like, but it takes additional work.  Its competition, notably SAS and SPSS, let you choose the font and full formatting of your output tables at installation. From then on, any time you want to save output to a word processor, it’s a simple cut & paste operation. SPSS even formats R output to fully formatted tables, unlike any current R IDEs. Perhaps the R Consortium could pool the resources needed to develop this kind of output. If so, it would be a key aspect of their goal of speeding R’s adoption. (I do appreciate the greater power of LaTeX and the ease of use of knitr and Rmarkdown, but they’ll never match the widespread use of Word.)

Graphical User Interface

Programming offers the greatest control over an analysis, but many researchers don’t analyze data often enough to become good programmers; many simply don’t like programming. Graphical User Interfaces (GUIs) help such people get their work done more easily. The traditional menu-based systems, such as R Commander or Deducer, make one-time work easy, but they don’t offer a way to do repetitive projects without relying on the code that non-programmers wish to avoid.

Workflow-based GUIs are also easy to use and, more importantly, they save all the steps as a flowchart. This allows you to check your work and repeat it on another data set simply by updating the data import node(s) and clicking “execute.” To take advantage of this approach, Microsoft’s Revolution R Enterprise integrates into Alteryx and KNIME, and Tibco’s Enterprise Runtime for R integrates into KNIME as well. Alteryx is a commercial package, and KNIME is free and open source on the desktop. While both have commercial partners, each can work with the standard community version of R as well.

Both packages contain many R functions that you can control with a dialog box. Both also allow R programmers to add a programming node in the middle of the workflow.  Those nodes can be shared, enabling an organization to get the most out of both their programming and non-programming analysts. Both systems need to add more R nodes to be considered general-purpose R GUIs, but they’re making fairly rapid progress on that front. In each system, it takes less than an hour to add a node to control a typical R function.

The R Consortium could develop a list of recommended steps for developers to consider. One of these steps could be adding nodes to such GUIs. Given the open source nature of R, encouraging the use of the open source version of KNIME would make the most sense. That would not just speed the adoption of R, it would enable its adoption by the large proportion of analysts who prefer not to program. For the more popular packages, the Consortium could consider using their own resources to write such nodes.

Conclusion

The creation of the R Consortium offers an intriguing opportunity to expand the use of R around the world. I’ve suggested several potential goals for the Consortium, including ways to help people choose packages, increase reliability testing, rating package support levels, increasing visibility of key generic functions, adding support for Word, and making R more accessible through stronger GUI support. What else should the R Consortium consider?  Let’s hear your ideas in the comments section below.

Is your organization still learning R?  I’d be happy to stop by and help. I also have a workshop, R for SAS, SPSS and Stata Users, on DataCamp.com. If you found this post useful, I invite you to follow me on Twitter.

Acknowledgements

Thanks to Drew Schmidt and Michael Berthold for their suggestions that improved this post.

References

  1. Micah Altman (2002), A Review of JMP 4.03 With Special Attention to its Numerical Accuracy, The American Statistician, 56:1, 72-75, DOI: 10.1198/000313002753631402
  2. D. McCullough (1998), Assessing the Reliability of Statistical Software: Part I, The American Statistician, 52:4, 358-366
  3. D. McCullough (1999), Assessing the Reliability of Statistical Software: Part II, The American Statistician, 53:2, 149-159
  4. Kellie B. Keeling, Robert J. Pavur (2007), A comparative study of the reliability of nine statistical software packages, Computational Statistics & Data Analysis, Vol. 51, Issue 8, pp. 3811-3831
  5. Oluwartotimi O. Odeh, Allen M. Featherstone and Jason S. Bergtold (2010), Reliability of Statistical Software, American Journal of Agricultural Economics,doi: 1093/ajae/aaq068
  6. Jason S. Bergtold, Krishna Pokharel and Allen Featherstone (2015), Selected Paper prepared for presentation at the 2015 Agricultural & Applied Economics Association and Western Agricultural Economics Association Annual Meeting, San Francisco, CA, July 26-28

To leave a comment for the author, please follow the link and comment on his blog: r4stats.com » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs