# R bloggers

### Predicting the six nations

(This article was first published on ** Mango Solutions**, and kindly contributed to R-bloggers)

By Douglas Ashton – Consultant, UK

I don’t know a lot about rugby, which can be a problem living in a rugby town. Especially when the office sweep stake on the upcoming Wales vs England six nations game goes round: apparently 2-1 is not a valid rugby score. I’m not about to put a pound down without some research. Fortunately England and Wales have played each other before:

wikipedia.org/wiki/History_of_rugby_union_matches_between_England_and_Wales

and R has some nice tools for grabbing data from the web.

Scraping and cleaningTo get the data in a usable form the rvest package has some really useful tools. With a few lines of code we can pull the html data into a data frame.

Some grepping and date parsing later we have a cleaned up dataset.

Date Home England Wales 2014-03-09 England 29 18 2013-03-16 Wales 3 30 2012-02-25 England 12 19 2011-08-13 Wales 9 19 2011-08-06 England 23 19 2011-02-04 Wales 26 19

A quick look at the data suggests things are have been pretty even over the years, with some big wins for England around the turn of the Millenium and Wales dominant in the 60s and 70s.

Who’s going to win?If we just look at who has won previous encounters we see that Wales have a slight edge but nothing statistically significant.

Wales Wins England Wins Draw 56 52 12

How about if we take into account home and away form? The game on the 6th will be in Cardiff, will that give the edge to Wales?

Estimate Std. Error z value Pr(>|z|) homeEngland -0.769 0.278 -2.771 0.006 homeOther 0.000 1.414 0.000 1.000 homeWales 0.492 0.271 1.820 0.069

This suggests the chances of a Wales win is 62%. So I’d say yes. It’s not a powerful prediction but Wales tend to win in Wales. Good enough for me, I’ll go with Wales. OK, so what’s the damage going to be?

What’s the score?

The sweep stake requires scores. This is the bit I really have no idea about, for a football fan used to scores such as 2-0, rugby scores seem arbitrarily large. Back to the data I guess. First up, what’s the total?

Interestingly it looks like the total points has been going up since the 50s. At this point I’m desperate, let’s predict the total score by fitting since the 50s and extrapolating. When Wales win there tend to be less points, let’s throw that into the model as well, it will screen out those silly big English wins at the Millenium.

Which predicts a total score on Friday of 39 and a difference of 9, giving my final prediction as

**Wales 24 – 15 England**

That’ll do for a pound I think.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Mango Solutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Rmetrics Summer Workshop 2015

(This article was first published on ** Rmetrics blogs**, and kindly contributed to R-bloggers)

To **leave a comment** for the author, please follow the link and comment on his blog: ** Rmetrics blogs**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Photos of the 6th MilanoR meeting

(This article was first published on ** MilanoR**, and kindly contributed to R-bloggers)

Photos of the 6th MilanoR meeting

Milano; December 18, 2014

To **leave a comment** for the author, please follow the link and comment on his blog: ** MilanoR**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Visualizing Home Ownership With Small Multiples And R

(This article was first published on ** Ripples**, and kindly contributed to R-bloggers)

If everybody had an ocean, across the U.S.A., then everybody’d be surfin’ like California (Beach Boys, Surfin’ U.S.A.)

I was invited to write a post for Domino Data Lab, a company based in California which provides a cloud-based machine learning platform which enables companies to use the power of the cloud to build analytical projects. I also discovered recently this book which support the premises of companies like Domino Data Lab which are leading the change in the way of doing data science. How I wish to forget in the future expressions like *execution time*, *update versions* and *memory limit*!

Since I like a lot Small multiples, I decided to plot the evolution of homeownership across the United States (the more I use GridExtra package the more I like it). You can read the post here (code included).

By the way, if you want to go to Gigaom Structure Data 2015 for free, Domino Data Lab is giving away 2 tickets here.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Ripples**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Financial Charts | Pan and Zoom

(This article was first published on ** Timely Portfolio**, and kindly contributed to R-bloggers)

To **leave a comment** for the author, please follow the link and comment on his blog: ** Timely Portfolio**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to Get the Frequency Table of a Categorical Variable as a Data Frame in R

(This article was first published on ** The Chemical Statistician » R programming**, and kindly contributed to R-bloggers)

One feature that I like about R is the ability to access and manipulate the outputs of many functions. For example, you can extract the kernel density estimates from density() and scale them to ensure that the resulting density integrates to 1 over its support set.

I recently needed to get a **frequency table** of a **categorical variable** in R, and I wanted the output as a **data table** that I can access and manipulate. This is a fairly simple and common task in statistics and data analysis, so I thought that there must be a function in Base R that can easily generate this. Sadly, I could not find such a function. In this post, I will explain why the seemingly obvious table() function does not work, and I will demonstrate how the count() function in the ‘plyr’ package can achieve this goal.

The Example Data Set – mtcars

Let’s use the mtcars data set that is built into R as an example. The categorical variable that I want to explore is “**gear**” – this denotes the number of forward gears in the car – so let’ s view the first 6 observations of just the car model and the gear. We can use the subset() function to restrict the data set to show just the row names and “gear”.

What are the possible values of “gear”? Let’s use the factor() function to find out.

> factor(mtcars$gear) [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4 Levels: 3 4 5The cars in this data set have either 3, 4 or 5 forward gears. **How many cars are there for each number of forward gears?**

Why the table() function does not work well

The table() function in Base R does give the counts of a categorical variable, but the output is not a data frame – it’s a table, and it’s not easily accessible like a data frame.

> w = table(mtcars$gear) > w 3 4 5 15 12 5 > class(w) [1] "table"You can convert this to a data frame, but the result does not retain the variable name “gear” in the corresponding column name.

> t = as.data.frame(w) > t Var1 Freq 1 3 15 2 4 12 3 5 5You can correct this problem with the names() function.

> names(t)[1] = 'gear' > t gear Freq 1 3 15 2 4 12 3 5 5I finally have what I want, but that took several functions to accomplish. Is there an easier way?

count() to the Rescue! (With Complements to the “plyr” Package)

Thankfully, there is an easier way – it’s the count() function in the “plyr” package. If you don’t already have the “plyr” package, install it first – run the command

install.packages('plyr')Then, call its library, and the count() function will be ready for use.

> library(plyr) > count(mtcars, 'gear') gear freq 1 3 15 2 4 12 3 5 5 > y = count(mtcars, 'gear') > y gear freq 1 3 15 2 4 12 3 5 5 > class(y) [1] "data.frame"As the class() function confirms, this output is indeed a data frame!

Filed under: Applied Statistics, Categorical Data Analysis, Data Analysis, Descriptive Statistics, R programming, Statistics, Tutorials Tagged: categorical variable, class(), count, data frame, factor, frequency table, install.packages(0, mtcars, names(), plyr, R, R programming, subset, table()

To **leave a comment** for the author, please follow the link and comment on his blog: ** The Chemical Statistician » R programming**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tutorial on High-Performance Computing in R

(This article was first published on ** Mad (Data) Scientist**, and kindly contributed to R-bloggers)

I wanted to call your attention to what promises to be an outstanding tutorial on High-Performance Computing (HPC) in R, presented in Web streaming format. My Rth package coauthor Drew Schmidt, who is also one of the authors of the pbdR package, will be one of the presenters. Should very interesting and useful.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Mad (Data) Scientist**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Base R Assessment!

(This article was first published on ** Econometrics by Simulation**, and kindly contributed to R-bloggers)

Built using the R powered adaptive testing platform Concerto, this assessment provides a short but powerful tool at evaluating your base R understanding relative to that of your peers.

Currently the assessment has over seventy items (questions) in the pool while each individual takes less than twenty of these selected randomly. So each test is unique.

Those who score well enough will be given the chance to contribute their own items to challenge other users.

http://concerto4.e-psychometrics.com/?wid=13&tid=14

To **leave a comment** for the author, please follow the link and comment on his blog: ** Econometrics by Simulation**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Sharing Your Shiny Apps

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

by Siddarth Ramesh

R Programmer, Revolution Analytics

A couple of months ago, I worked on a customer engagement involving Shiny. Shiny is a package created by RStudio that is intended to make plots dynamic and interactive. One advantage of Shiny is that an app would completely be written in R, without needing to know any other programming languages. Shiny is a powerful tool for creating an interactive interface for applications such as predictive analysis. In this post I’ll go over a higher level view of Shiny, and delve into one of its deepest features – the ease of sharing your app with the world.

For those of you who are not familiar with Shiny, I’ll briefly provide a description of the high level architecture. A Shiny application is comprised of two R files – a server and a user interface (UI) file. The UI file acts as an HTML interpreter – you can create your buttons, checkboxes, images, and other HTML widgets from here. The server file is where Shiny’s real magic happens. This is where you can make those buttons and checkboxes that you created in your UI actually do something. In other words, it is where R users can turn their app, using only R, into a dynamic visual masterpiece. If you type in runApp("appName") from your console window, with the Shiny app folder as your working directory of course, then you can see your output.

The following is a small Shiny App I made. Note that both the UI and the server.R files work in conjunction and must be in the same directory to create the final output:

UI Code

Server Code

With Shiny, you can completely customize your specifications for a plot and the plot will change right in front of you.

One of Shiny’s benefits lies in the ease with which Shiny apps can be shared. Originally, I thought that in order to host my Shiny app on the web, I would need to somehow procure a dedicated server and acquire a URL before I could even think about sharing it with the world. This is a complicated process, and I had a server and was figuring out how to get the URL when I discovered shinyapps.io.

Shinyapps.io is a platform which allows you to share your Shiny applications online. In order to use Shinyapps.io, you would first have to install it with:

devtools::install_github('rstudio/shinyapps')

Depending on the type of operating system you have, Shinyapps.io has a few dependencies that need to be installed. Since I am a Windows user, I needed RTools and the devtools R package. Linux users need GCC, and Mac users need XCode Command Line Tools.

Instead of running the runApp("ShinyAppName") command which opens up your Shiny app locally on your machine, you would instead run deployApp("ShinyAppName"). The command “deployApp()” automatically syncs up to the Shinyapps.io server, and opens up the shinyapps.io website on your browser. If you have never used shinyapps.io, you must first set up an account. Creating an account is simple to do, and once you have a shinyapps.io account, your application from RStudio will become an instance on the shinyapps.io server with its own URL. At this point, your application is on the internet and can be viewed by anybody with access to it.

There are some advantages to using Shinyapps.io. One is that it negates the need for your own server, as you would be hosting it on the Shinyapps.io virtualized server. This saves you some money, and you would not have to worry about maintaining the server. If you are worried about security, you do not have to be because each shiny app is in a protected environment, each application is SSL encrypted, and user authentication is offered. One thing that R users who use R Markdown may notice is that the process of uploading a Shiny apps to shinyapps.io is fairly similar to uploading a Markdown file to Rpubs.

This link following is the output of the simple Shiny application I had previously created, and hosted as an instance on shinyapps.io:

Shiny is constantly being improved upon, and as aesthetically pleasing and smooth as it is right now, it is only getting more elegant as time goes by. If you are interested in exploring the power and diversity of Shiny, check out this link below!

To **leave a comment** for the author, please follow the link and comment on his blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A data.table R tutorial by DataCamp: intro to DT[i, j, by]

(This article was first published on ** DataCamp Blog » R**, and kindly contributed to R-bloggers)

This data.table R tutorial explains the basics of the DT[i, j, by] command which is core to the data.table package. If you want to learn more on the data.table package, DataCamp provides an interactive R course on the data.table package. The course has more than 35 interactive R exercises – all taking place in the comfort of your own browser – and several videos with Matt Dowle, main author of the data.table package, and Arun Srinivasan, major contributor. Try if for free.

If you have already worked with large datasets in RAM (1 to more than 100GB), you know that a data.frame can be limiting: the time it takes to do certain things is just too long. Data.table solves this for you by reducing computing time. Evenmore, it also makes it easier to do more with less typing. Once you master the data.table syntax from this data.table R tutorial, the simplicity of doing complicated operations will astonish you. So you will not only be reducing computing time, but programming time as well.

The DT[i,j,by] command has three parts: i, j and by. If you think in SQL terminology, the i corresponds to WHERE, j to SELECT and by to GROUP BY. We talk about the command by saying “Take DT, subset the rows using ‘i’, then calculate ‘j’ grouped by ‘by’”. So in a simple example and using the hflights dataset (so you can reproduce all the examples) this gives:

library(hflights) library(data.table) DT <- as.data.table(hflights) DT[Month==10,mean(na.omit(AirTime)), by=UniqueCarrier] UniqueCarrier V1 1: AA 68.76471 2: AS 255.29032 3: B6 176.93548 4: CO 141.52861 ...Where we subsetted the data table to keep only the rows of the 10th Month of the year, calculated the average AirTime of the planes that actually flew (that’s why na.omit() is used, cancelled flights don’t have a value for their AirTime) and then grouped the results by their Carrier. We can see for example that AA (American Airlines) has a very short average AirTime compared to AS (Alaska Airlines). Did you also notice that R base functions can be used in the j part? We will get to that later.

**The i part**

The ‘i’ part is used for subsetting on rows, just like in a data frame.

DT[2:5] #selects the second to the fifth row of DT Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime 1: 2011 1 2 7 1401 1501 AA 428 N557AA 60 45 2: 2011 1 3 1 1352 1502 AA 428 N541AA 70 48 3: 2011 1 4 2 1403 1513 AA 428 N403AA 70 39 4: 2011 1 5 3 1405 1507 AA 428 N492AA 62 44 ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted 1: -9 1 IAH DFW 224 6 9 0 0 2: -8 -8 IAH DFW 224 5 17 0 0 3: 3 3 IAH DFW 224 9 22 0 0 4: -3 5 IAH DFW 224 9 9 0 0But you can also use column names, as they are evaluated in the scope of DT.

DT[UniqueCarrier=="AA"] #Returns all those rows where the Carrier is American Airlines Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime 1: 2011 1 1 6 1400 1500 AA 428 N576AA 60 2: 2011 1 2 7 1401 1501 AA 428 N557AA 60 3: 2011 1 3 1 1352 1502 AA 428 N541AA 70 4: 2011 1 4 2 1403 1513 AA 428 N403AA 70 5: 2011 1 5 3 1405 1507 AA 428 N492AA 62 --- 3240: 2011 12 27 2 1021 1333 AA 2234 N3ETAA 132 3241: 2011 12 28 3 1015 1329 AA 2234 N3FJAA 134 3242: 2011 12 29 4 1023 1335 AA 2234 N3GSAA 132 3243: 2011 12 30 5 1024 1334 AA 2234 N3BAAA 130 3244: 2011 12 31 6 1024 1343 AA 2234 N3HNAA 139 AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted 1: 40 -10 0 IAH DFW 224 7 13 0 0 2: 45 -9 1 IAH DFW 224 6 9 0 0 3: 48 -8 -8 IAH DFW 224 5 17 0 0 4: 39 3 3 IAH DFW 224 9 22 0 0 5: 44 -3 5 IAH DFW 224 9 9 0 0 --- 3240: 112 -12 1 IAH MIA 964 8 12 0 0 3241: 112 -16 -5 IAH MIA 964 9 13 0 0 3242: 110 -10 3 IAH MIA 964 12 10 0 0 3243: 110 -11 4 IAH MIA 964 9 11 0 0 3244: 119 -2 4 IAH MIA 964 8 12 0 0Notice that you don’t have to use a comma for subsetting rows in a data table. In a data.frame doing this DF[2:5] would give all the rows of the 2nd to 5th column. Instead (as everyone reading this obviously knows), we have to specify DF[2:5,]. Also notice that DT[,2:5] does not mean anything for data tables, as is explained in the first question of the FAQs of the data.table package.

*Quirky and useful*: when subsetting rows you can also use the symbol .N in the DT[…] command, which is the number of rows or the last row. You can use it for selecting the last row or an offset from it.

**The j part**

The ‘j’ part is used to select columns and do *stuff* with them. And *stuff* can really mean anything. All kinds of functions can be used, which is a strong point of the data.table package.

Notice that the ‘i’ part is left blank, and the first thing in the brackets is a comma. This might seem counterintuitive at first. However, this simply means that we do not subset on any rows, so all rows are selected. In the ‘j’ part, the average delay on arrival of all flights is calculated. It appears that the average plane of the hflights dataset had more than 7 minutes delay. Be prepared when catching your next flight!

When selecting several columns and doing *stuff *with them in the ‘j’ part, you need to use the ‘.()’ notation. This notation is actually just an alias to ‘list()’. It returns a data table, whereas not using ‘.()’ only returns a vector, as shown above.

Another useful feature which requires the ‘.()’ notation allows you to rename columns inside the DT[…] command.

DT[, .(Avg_ArrDelay = mean(na.omit(ArrDelay)))] Avg_ArrDelay 1: 7.094334 DT[, .(Avg_DepDelay = mean(na.omit(DepDelay)), avg_ArrDelay = mean(na.omit(ArrDelay)))] Avg_DepDelay Avg_ArrDelay 1: 9.444951 7.094334Of course, new column names are not obligatory.

Combining the above about ‘i’ and ‘j’ gives:

DT[UniqueCarrier=="AA", .(Avg_DepDelay = mean(na.omit(DepDelay)), Avg_ArrDelay = mean(na.omit(ArrDelay)), plot(DepTime,DepDelay,ylim=c(-15,200)), abline(h=0))] Avg_DepDelay Avg_ArrDelay V3 V4 1: 6.390144 0.8917558 NULL NULL

Here we took DT, selected all rows where the carrier was AA in the ‘i’ part, calculated the average delay on departure and on arrival, and plotted the time of departure against the delay on departure in the ‘j’ part.

To recap, the ‘j’ part is used to do calculations on columns specified in that part. As the columns of a data table are seen as variables, and the parts of ‘j’ are evaluated as expressions, virtually anything can be done in the ‘j’ part. This significantly shortens your programming time.

**The by part**

The final section of this data.table R tutorial focuses on the ‘by’ part. The ‘by’ part is used when we want to calculate the ‘j’ part grouped by a specific variable (or a manipulation of that variable). You will see that the ‘j’ expression is repeated for each ‘by’ group. It is simple to use: you just specify the column you want to group by in the ‘by’ argument.

DT[,mean(na.omit(DepDelay)),by=Origin] Origin V1 1: IAH 8.436951 2: HOU 12.837873Here we calculated the average delay before departure, but grouped by where the plane is coming from. It seems that flights departing from HOU have a larger average delay than those leaving from IAH.

Just as with the ‘j’ part, you can do a lot of *stuff *in the ‘by’ part. Functions can be used in the ‘by’ part so that results of the operations done in the ‘j’ part are grouped by something we specified in the DT[…] command. Using functions inside DT[…] makes that one line very powerful. Likewise, the ‘.()’ notation needs to be used when using several columns in the ‘by’ part.

Here, the average delay before departure of all planes (no subsetting in the ‘i’ part, so all rows are selected) was calculated first, and grouped secondly, first by origin of the plane and then by weekday. Weekdays is False in the weekends. It appears that the average delay before departure was larger when the plane left from HOU than from IAH, and surprisingly the delays were smaller in the weekends.

Putting it all together a typical DT[i,j,by] command gives:

DT[UniqueCarrier=="DL", .(Avg_DepDelay = mean(na.omit(DepDelay)), Avg_ArrDelay = mean(na.omit(ArrDelay)), Compensation = mean(na.omit(ArrDelay - DepDelay))), by = .(Origin, Weekdays = DayOfWeek&lt;6)] Origin Weekdays Avg_DepDelay Avg_ArrDelay Compensation 1: IAH FALSE 8.979730 4.116751 -4.825719 2: HOU FALSE 7.120000 2.656566 -4.555556 3: IAH TRUE 9.270948 6.281941 -2.836609 4: HOU TRUE 11.631387 10.406593 -1.278388Here the subset of planes flewn by Delta Air Lines (selected in ‘i’) was grouped by their origin and by Weekdays (in ‘by’). The time that was compensated in air was also calculated (in ‘j’). It appears that in the weekends, irrespective of the plane was coming from IAH or HOU, the time compensated while in air (thus by flying faster) is bigger.

There is much more to discover in the data table package, but this post illustrated the basic DT[i,j,by] command. The DataCamp course explains the whole data table package extensively. You can do the exercises at your own pace in your browser while getting hints and feedback, and review the videos and slides as much as you want. This interactive way of learning allows you to gain profound knowledge and practical experience with data tables. Try it for free.

Hopefully you know understand thanks to this data.table R tutorial the fundamental syntax of data.table, and are you ready to experiment yourself. If you have questions concerning the data.table package, have a look here. Matt and Arun are very active. One of the next blogposts on the data.table package will be more technical, zooming in on the wide possibilities with data tables. Stay tuned!

The post A data.table R tutorial by DataCamp: intro to DT[i, j, by] appeared first on DataCamp Blog.

To **leave a comment** for the author, please follow the link and comment on his blog: ** DataCamp Blog » R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R + ggplot2 Graph Catalog

(This article was first published on ** Getting Genetics Done**, and kindly contributed to R-bloggers)

You can use the panel on the left to filter by plot type, graphical elements, or the chapter of the book if you’re actually using it. All of the code and data used for this website is open-source, in this GitHub repository. Here's an example for plotting population demographic data by county that uses faceting to create small multiples:library(ggplot2)

library(reshape2)

library(grid)

this_base = "fig08-15_population-data-by-county"

my_data = data.frame(

Race = c("White", "Latino", "Black", "Asian American", "All Others"),

Bronx = c(194000, 645000, 415000, 38000, 40000),

Kings = c(855000, 488000, 845000, 184000, 93000),

New.York = c(703000, 418000, 233000, 143000, 39000),

Queens = c(733000, 556000, 420000, 392000, 128000),

Richmond = c(317000, 54000, 40000, 24000, 9000),

Nassau = c(986000, 133000, 129000, 62000, 24000),

Suffolk = c(1118000, 149000, 92000, 34000, 26000),

Westchester = c(592000, 145000, 123000, 41000, 23000),

Rockland = c(205000, 29000, 30000, 16000, 6000),

Bergen = c(638000, 91000, 43000, 94000, 18000),

Hudson = c(215000, 242000, 73000, 57000, 22000),

Passiac = c(252000, 147000, 60000, 18000, 12000))

my_data_long = melt(my_data, id = "Race",

variable.name = "county", value.name = "population")

my_data_long$county = factor(

my_data_long$county, c("New.York", "Queens", "Kings", "Bronx", "Nassau",

"Suffolk", "Hudson", "Bergen", "Westchester",

"Rockland", "Richmond", "Passiac"))

my_data_long$Race =

factor(my_data_long$Race,

rev(c("White", "Latino", "Black", "Asian American", "All Others")))

p = ggplot(my_data_long, aes(x = population / 1000, y = Race)) +

geom_point() +

facet_wrap(~ county, ncol = 3) +

scale_x_continuous(breaks = seq(0, 1000, 200),

labels = c(0, "", 400, "", 800, "")) +

labs(x = "Population (thousands)", y = NULL) +

ggtitle("Fig 8.15 Population Data by County") +

theme_bw() +

theme(panel.grid.major.y = element_line(colour = "grey60"),

panel.grid.major.x = element_blank(),

panel.grid.minor = element_blank(),

panel.margin = unit(0, "lines"),

plot.title = element_text(size = rel(1.1), face = "bold", vjust = 2),

strip.background = element_rect(fill = "grey80"),

axis.ticks.y = element_blank())

p

ggsave(paste0(this_base, ".png"),

p, width = 6, height = 8)

Keep in mind not all of these visualizations are recommended. You’ll find pie charts, ugly grouped bar charts, and other plots for which I can’t think of any sensible name. Just because you

*can*use the add_cat() function from Hilary Parker’s cats package to fetch a random cat picture from the internet and create an annotation_raster layer to add to your ggplot2 plot, doesn’t necessarily mean you

*should*do such a thing for a publication-quality figure. But if you ever needed to know how, this R graph catalog can help you out.library(ggplot2)

this_base = "0002_add-background-with-cats-package"

## devtools::install_github("hilaryparker/cats")

library(cats)

## library(help = "cats")

p = ggplot(mpg, aes(cty, hwy)) +

add_cat() +

geom_point()

p

ggsave(paste0(this_base, ".png"), p, width = 6, height = 5)

R graph catalog (via Laura Wiley)Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Getting Genetics Done**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Canberra IAPA Seminar – Text Analytics: Natural Language into Big Data – 17 February

(This article was first published on ** blog.RDataMining.com**, and kindly contributed to R-bloggers)

Topic: **Text Analytics: Natural Language into Big Data**

Speaker: Dr. Leif Hanlen, Technology Director at NICTA

Date: Tuesday 17 February

Time: 5.30pm for a 6pm start

Cost: Nil

Where: SAS Offices, 12 Moore Street, Canberra, ACT 2600

Registration URL: http://www.iapa.org.au/Event/TextAnalyticsNaturalLanguageIntoBigData

Abstract:

We outline several activities in NICTA relating to understanding and mining free text. Our approach is to develop agile service-focussed solutions that provide insight into large text corpora, and allow end users to incorporate current text documents into standard numerical analysis technologies.

Biography:

Dr. Leif Hanlen is Technology Director at NICTA, Australia’s largest ICT research centre. Leif is also an adjunct Associate Professor of ICT at the Australian National University and an adjunct Professor of Health at the University of Canberra. He received a BEng (Hons I) in electrical engineering, BSc (Comp Sci) and PhD (telecomm) from the University of Newcastle Australia. His research focusses on applications Machine Learning to text processing.

Please feel free to forward this invite to your friends and colleagues who might be interested. Thanks.

To **leave a comment** for the author, please follow the link and comment on his blog: ** blog.RDataMining.com**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### BayesFactor version 0.9.10 released to CRAN

(This article was first published on ** BayesFactor: Software for Bayesian inference**, and kindly contributed to R-bloggers)

See below the fold for changes.

CHANGES IN BayesFactor VERSION 0.9.10CHANGES

- Fixed bug in model enumeration code in generalTestBF (affected "withmain" analyses with neverExclude argument)
- Various bug fixes
- Analyses are more error tolerant (problem analyses will yield NA in BayesFactor object)
- Fixed some typos in citation information
- Improved numerical stability of proportional error estimates

To **leave a comment** for the author, please follow the link and comment on his blog: ** BayesFactor: Software for Bayesian inference**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R in Insurance 2015: Registration Opened

(This article was first published on ** mages' blog**, and kindly contributed to R-bloggers)

**29 June 2015**at the University of Amsterdam has opened.

This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.

The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.

Invited talks will be given by:

- Prof. Richard Gill, Leiden University
- Dr James Guszcza, FCAS, Chief Data Scientist, Deloitte - US

*Statistical Errors in Court*or Jim's talk on

*Predictive Modelling and Behaviour Insight*or

*Actuarial Analytics in R*. We are thrilled that they accepted our invitation.

We invite you to submit a one-page abstract for consideration. Both academic and practitioner proposals related to R are encouraged. The submission deadline for abstracts is 28 March 2015.

Please email your abstract of no more than 300 words (in text or pdf format) to r-in-insurance@uva.nl.

Details about the registration and abstract submission are given on the dedicated R in Insurance page at the University of Amsterdam.

For more information about the past events visit www.rininsurance.com. This post was originally published at mages' blog.

To **leave a comment** for the author, please follow the link and comment on his blog: ** mages' blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Shiny for Interactive Application Development using R

(This article was first published on ** Adventures in Analytics and Visualization**, and kindly contributed to R-bloggers)

To **leave a comment** for the author, please follow the link and comment on his blog: ** Adventures in Analytics and Visualization**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Paris’s history, captured in its streets

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

The following image by Mathieu Rajerison has been doing the rounds of French media recently. It shows the streets of Paris, color-coded by their compass direction. It's been featured in an article in Telerama magazine, and even on French TV Channel LCI (skip ahead to 8:20 in the linked video. which also features an interview with Mathieu).

Mathieu used the R language and OpenStreetMap data to construct the image, which colorizes each street according to the compass direction it points. Orthogonal streets are colored the same, so regular grids appear as swathes of uniform color. A planned city like Chicago, would appear as a largely monochrome grid, but Paris exhibits much more variation. (You can see many other cities in this DataPointed.net article.) As this article in the French edition of Slate explains, the very history of Paris itself is encapsulated in the colored segments. You can easily spot Napoleon's planned boulevards as they cut through the older medieval neighborhoods, and agglomerated villages like Montmartre appear as rainbow-hued nuggets.

Mathieu explains the process of creating the chart in a blog post written in English. He uses the maptools package to import the OpenStreetMap shapefile and to extract the orientations of the streets. A simple R function is used to select colors for the streets, and then the entire map is sampled to a grid with the spatstat package, before finally being exported as a TIFF by the raster package. The entire chart is created with just 31 lines of R code, which you can find at the link below.

Data and GIS tips: Streets of Paris Colored by Orientation

To **leave a comment** for the author, please follow the link and comment on his blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Should you teach Python or R for data science?

(This article was first published on ** R - Data School**, and kindly contributed to R-bloggers)

Last week, I published a post titled Lessons learned from teaching an 11-week data science course, detailing my experiences and recommendations from teaching General Assembly's 66-hour introductory data science course.

In the comments, I received the following question:

I'm part of a team developing a course, with NSF support, in data science. The course will have no prerequisites and will be targeted for non-technical majors, with a goal to show how useful data science can be in their own area. Some of the modules we are developing include, for example, data cleansing, data mining, relational databases and NoSQL data stores. We are considering as tools the statistical environment R and Python and will likely develop two versions of this course. For now, we'd appreciate your sense of the relative merits of those two environments. We are hoping to get a sense of what would be more appropriate for computer and non computer science students, so if you have a sense of what colleagues that you know would prefer, that also would be helpful.

That's an excellent question! It doesn't have a simple answer (in my opinion) because both languages are great for data science, but one might be better than the other depending upon your students and your priorities.

At General Assembly in DC, we currently teach the course entirely in Python, though we used to teach it in both R and Python. I also mentor data science students in R, and I'm a teaching assistant for online courses in both R and Python. I enjoy using both languages, though I have a slight personal preference for Python specifically because of its machine learning capabilities (more details below).

Here are some questions that might help you (as educators or curriculum developers) to assess which language is a better fit for your students:

Do your students have experience programming in other languages?If your students have some programming experience, Python may be the better choice because its syntax is more similar to other languages, whereas R's syntax is thought to be unintuitive by many programmers. If your students don't have any programming experience, I think both languages have an equivalent learning curve, though many people would argue that Python is easier to learn because its code reads more like regular human language.

Do your students want to go into academia or industry?In academia, especially in the field of statistics, R is much more widely used than Python. In industry, the data science trend is slowly moving from R towards Python. One contributing factor is that companies using a Python-based application stack can more easily integrate a data scientist who writes Python code, since that eliminates a key hurdle in "productionizing" a data scientist's work.

Are you teaching "machine learning" or "statistical learning"?The line between these two terms is blurry, but machine learning is concerned primarily with predictive accuracy over model interpretability, whereas statistical learning places a greater priority on interpretability and statistical inference. To some extent, R "assumes" that you are performing statistical learning and makes it easy to assess and diagnose your models. scikit-learn, by far the most popular machine learning package for Python, is more concerned with predictive accuracy. (For example, scikit-learn makes it very easy to tune and cross-validate your models and switch between different models, but makes it much harder than R to actually "examine" your models.) Thus, R is probably the better choice if you are teaching statistical learning, though Python also has a nice package for statistical modeling (Statsmodels) that duplicates some of R's functionality.

Do you care more about the ease with which students can get started in machine learning, or the ease with which they can go deeper into machine learning?In R, getting started with your first model is easy: read your data into a data frame, use a built-in model (such as linear regression) along with R's easy-to-read formula language, and then review the model's summary output. In Python, it can be much more of a challenging process to get started simply because there are so many choices to make: How should I read in my data? Which data structure should I store it in? Which machine learning package should I use? What type of objects does that package allow as input? What shape should those objects be in? How do I include categorical variables? How do I access the model's output? (Et cetera.) Because Python is a general purpose programming language whereas R specializes in a smaller subset of statistically-oriented tasks, those tasks tend to be easier to do (at least initially) in R.

However, once you have mastered the basics of machine learning in Python (using scikit-learn), I find that machine learning is actually a lot easier in Python than in R. scikit-learn provides a clean and consistent interface to tons of different models. It provides you with many options for each model, but also chooses sensible defaults. Its documentation is exceptional, and it helps you to understand the models as well as how to use them properly. It is also actively being developed.

In R, switching between different models usually means learning a new package written by a different author. The interface may be completely different, the documentation may or may not be helpful in learning the package, and the package may or may not be under active development. (caret is an excellent R package that attempts to provide a consistent interface for machine learning models in R, but it's nowhere near as elegant a solution as scikit-learn.) In summary, machine learning in R tends to be a more tiresome experience than machine learning in Python once you have moved beyond the basics. As such, Python may be a better choice if students are planning to go deeper into machine learning.

Do your students care about learning a "sexy" language?R is not a sexy language. It feels old, and its website looks like it was created around the time the web was invented. Python is the "new kid" on the data science block, and has far more sex appeal. From a marketing perspective, Python may be the better choice simply because it will attract more students.

How computer savvy are your students?Installing R is a simple process, and installing RStudio (the de facto IDE for R) is just as easy. Installing new packages or upgrading existing packages from CRAN (R's package management system) is a trivial process within RStudio, and even installing packages hosted on GitHub is a simple process thanks to the devtools package.

By comparison, Python itself may be easy to install, but installing individual Python packages can be much more challenging. In my classroom, we encourage students to use the Anaconda distribution of Python, which includes nearly every Python package we use in the course and has a package management system similar to CRAN. However, Anaconda installation and configuration problems are still common in my classroom, whereas these problems were much more rare when using R and RStudio. As such, R may be the better choice if your students are not computer savvy.

Is data cleaning a focus of your course?Data cleaning (also known as "data munging") is the process of transforming your raw data into a more meaningful form. I find data cleaning to be easier in Python because of its rich set of data structures, as well as its far superior implementation of regular expressions (which are often necessary for cleaning text).

Is data exploration a focus of your course?The pandas package in Python is an extremely powerful tool for data exploration, though its power and flexibility can also make it challenging to learn. R's dplyr is more limited in its capabilities than pandas (by design), though I find that its more focused approach makes it easier to figure out how to accomplish a given task. As well, dplyr's syntax is more readable and thus is easier for me to remember. Although it's not a clear differentiator, I would consider R a slightly easier environment for getting started in data exploration due to the ease of learning dplyr.

Is data visualization a focus of your course?R's ggplot2 is an excellent package for data visualization. Once you understand its core principles (its "grammar of graphics"), it feels like the most natural way to build your plots, and it becomes easy to produce sophisticated and attractive plots. Matplotlib is the de facto standard for scientific plotting in Python, but I find it tedious both to learn and to use. Alternatives like Seaborn and pandas plotting still require you to know some Matplotlib, and the alternative that I find most promising (ggplot for Python) is still early in development. Therefore, I consider R the better choice for data visualization.

Is Natural Language Processing (NLP) part of your curriculum?Python's Natural Language Toolkit (NLTK) is a mature, well-documented package for NLP. TextBlob is a simpler alternative, spaCy is a brand new alternative focused on performance, and scikit-learn also provides some supporting functionality for text-based feature extraction. In comparison, I find R's primary NLP framework (the tm package) to be significantly more limited and harder to use. Even if there are additional R packages that can fill in the gaps, there isn't one comprehensive package that you can use to get started, and thus Python is the better choice for teaching NLP.

If you are a data science educator, or even just a data scientist who uses R or Python, **I'd love to hear from you in the comments!** On which points above do you agree or disagree? What are some important factors that I have left out? What language do you teach in the classroom, and why?

I look forward to this conversation!

P.S. Want to hear about new Data School posts or video tutorials? Subscribe to my newsletter.

P.P.S. Want to Tweet about this post? Here's a Tweet you can RT.

To **leave a comment** for the author, please follow the link and comment on his blog: ** R - Data School**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### My "Top 5 R Functions"

(This article was first published on ** Minding the Brain**, and kindly contributed to R-bloggers)

- subset() for making subsets of data (natch)
- merge() for combining data sets in a smart and easy way
- melt() for converting from wide to long data formats
- dcast() for converting from long to wide data formats, and for making summary tables
- ddply() for doing split-apply-combine operations, which covers a huge swath of the most tricky data operations

Conspicuously missing from the above list is ggplot, which I think deserves a special lifetime achievement award for how it has transformed how I think about data exploration and data visualization. I'm planning that for the next R Workgroup meeting.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Minding the Brain**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Embedding R-generated Interactive HTML pages in MS PowerPoint

(This article was first published on ** Mango Solutions**, and kindly contributed to R-bloggers)

By Richard Pugh – Commercial Director, UK.** **

Usually when I create slide decks these days I used markdown and slidy. However, I recently was asked to present using an existing Revolution Microsoft PowerPoint template.

Trouble is, I’ve been spoilt with the advantages of using a HTML-based presentation technology and I wanted to include some interactive web elements. In particular, I wanted to use a motion chart generated with the fantastic googleVis package. Of course, that presented an issue – how was I to include some interactive HTML elements in my PowerPoint deck?

The answer turned out to involve a PowerPoint plug-in called LiveWeb. These were the steps I took:

- Download LiveWeb from http://skp.mvps.org/liveweb.htm and install it. This adds a couple of buttons onto the PowerPoint tool bar (for me, it appears in the “Insert” part of the ribbon)

- Generate your web content. In my version, that meant using googleVis to generate a web page
- Use the LiveWeb plugin to point your slide to the web page
- Click play and wave hands like Hans Rosling which rapidly talking about your slide:

Btw – this also works for other HTML content, such as Shiny apps. Here’s one from the RStudio Shiny Example page …

So, if you have want to use MS PowerPoint, it is still possible to include R-generated interactive HTML content using the above steps.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Mango Solutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RcppStreams 0.1.0

(This article was first published on ** Thinking inside the box **, and kindly contributed to R-bloggers)

The new package RcppStreams arrived on CRAN on Saturday. RcppStreams brings the excellent Streamulus C++ template library for event stream processing to R.

Streamulus, written by Irit Katriel, uses very clever template meta-programming (via Boost Fusion) to implement an embedded *domain-specific event language* created specifically for event stream processing.

The packages wraps the four existing examples by Irit, all her unit tests and includes a slide deck from a workshop presentation. The framework is quite powerful, and it brings a very promising avenue for (efficient !!) stream event processing to R.

The NEWS file entries follows below:

Changes in version 0.1.0 (2015-01-30)

First CRAN release

Contains all upstream examples and documentation from Streamulus

Added Rcpp Attributes wrapping and minimal manual pages

Added Travis CI unit testing

Courtesy of CRANberries, there is also a copy of the DESCRIPTION file for this initial release. More detailed information is on the RcppStreams page page and of course on the Streamulus page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To **leave a comment** for the author, please follow the link and comment on his blog: ** Thinking inside the box **.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...