R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 15 min 44 sec ago

Sharing Your Shiny Apps

Tue, 2015-02-03 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Siddarth Ramesh
R Programmer, Revolution Analytics

A couple of months ago, I worked on a customer engagement involving Shiny. Shiny is a package created by RStudio that is intended to make plots dynamic and interactive. One advantage of Shiny is that an app would completely be written in R, without needing to know any other programming languages. Shiny is a powerful tool for creating an interactive interface for applications such as predictive analysis. In this post I’ll go over a higher level view of Shiny, and delve into one of its deepest features – the ease of sharing your app with the world.

For those of you who are not familiar with Shiny, I’ll briefly provide a description of the high level architecture. A Shiny application is comprised of two R files – a server and a user interface (UI) file. The UI file acts as an HTML interpreter – you can create your buttons, checkboxes, images, and other HTML widgets from here. The server file is where Shiny’s real magic happens. This is where you can make those buttons and checkboxes that you created in your UI actually do something. In other words, it is where R users can turn their app, using only R, into a dynamic visual masterpiece. If you type in runApp("appName") from your console window, with the Shiny app folder as your working directory of course, then you can see your output.

The following is a small Shiny App I made. Note that both the UI and the server.R files work in conjunction and must be in the same directory to create the final output:

UI Code

 

Server Code

 

With Shiny, you can completely customize your specifications for a plot and the plot will change right in front of you.

One of Shiny’s benefits lies in the ease with which Shiny apps can be shared. Originally, I thought that in order to host my Shiny app on the web, I would need to somehow procure a dedicated server and acquire a URL before I could even think about sharing it with the world. This is a complicated process, and I had a server and was figuring out how to get the URL when I discovered shinyapps.io

Shinyapps.io is a platform which allows you to share your Shiny applications online. In order to use Shinyapps.io, you would first have to install it with:

devtools::install_github('rstudio/shinyapps')

Depending on the type of operating system you have, Shinyapps.io has a few dependencies that need to be installed. Since I am a Windows user, I needed RTools and the devtools R package. Linux users need GCC, and Mac users need XCode Command Line Tools.

Instead of running the runApp("ShinyAppName") command which opens up your Shiny app locally on your machine, you would instead run deployApp("ShinyAppName"). The command “deployApp()” automatically syncs up to the Shinyapps.io server, and opens up the shinyapps.io website on your browser. If you have never used shinyapps.io, you must first set up an account. Creating an account is simple to do, and once you have a shinyapps.io account, your application from RStudio will become an instance on the shinyapps.io server with its own URL. At this point, your application is on the internet and can be viewed by anybody with access to it.

There are some advantages to using Shinyapps.io. One is that it negates the need for your own server, as you would be hosting it on the Shinyapps.io virtualized server. This saves you some money, and you would not have to worry about maintaining the server. If you are worried about security, you do not have to be because each shiny app is in a protected environment, each application is SSL encrypted, and user authentication is offered. One thing that R users who use R Markdown may notice is that the process of uploading a Shiny apps to shinyapps.io is fairly similar to uploading a Markdown file to Rpubs. 

This link following is the output of the simple Shiny application I had previously created, and hosted as an instance on shinyapps.io:

 

Shiny is constantly being improved upon, and as aesthetically pleasing and smooth as it is right now, it is only getting more elegant as time goes by.  If you are interested in exploring the power and diversity of Shiny, check out this link below!

http://www.showmeshiny.com/

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A data.table R tutorial by DataCamp: intro to DT[i, j, by]

Tue, 2015-02-03 10:07

(This article was first published on DataCamp Blog » R, and kindly contributed to R-bloggers)

This data.table R tutorial explains the basics of the DT[i, j, by] command which is core to the data.table package. If you want to learn more on the data.table package, DataCamp provides an interactive R course on the data.table package. The course has more than 35 interactive R exercises – all taking place in the comfort of your own browser – and several videos with Matt Dowle, main author of the data.table package, and Arun Srinivasan, major contributor. Try if for free.

If you have already worked with large datasets in RAM (1 to more than 100GB), you know that a data.frame can be limiting: the time it takes to do certain things is just too long. Data.table solves this for you by reducing computing time. Evenmore, it also makes it easier to do more with less typing. Once you master the data.table syntax from this data.table R tutorial, the simplicity of doing complicated operations will astonish you. So you will not only be reducing computing time, but programming time as well.

The DT[i,j,by] command has three parts: i, j and by. If you think in SQL terminology, the i corresponds to WHERE, j to SELECT and by to GROUP BY. We talk about the command by saying “Take DT, subset the rows using ‘i’, then calculate ‘j’ grouped by ‘by’”. So in a simple example and using the hflights dataset (so you can reproduce all the examples) this gives:

library(hflights) library(data.table) DT <- as.data.table(hflights) DT[Month==10,mean(na.omit(AirTime)), by=UniqueCarrier] UniqueCarrier V1 1: AA 68.76471 2: AS 255.29032 3: B6 176.93548 4: CO 141.52861 ...

Where we subsetted the data table to keep only the rows of the 10th Month of the year, calculated the average AirTime of the planes that actually flew (that’s why na.omit() is used, cancelled flights don’t have a value for their AirTime) and then grouped the results by their Carrier. We can see for example that AA (American Airlines) has a very short average AirTime compared to AS (Alaska Airlines).  Did you also notice that R base functions can be used in the j part? We will get to that later.

The i part

The ‘i’ part is used for subsetting on rows, just like in a data frame.

DT[2:5] #selects the second to the fifth row of DT Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime 1: 2011 1 2 7 1401 1501 AA 428 N557AA 60 45 2: 2011 1 3 1 1352 1502 AA 428 N541AA 70 48 3: 2011 1 4 2 1403 1513 AA 428 N403AA 70 39 4: 2011 1 5 3 1405 1507 AA 428 N492AA 62 44 ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted 1: -9 1 IAH DFW 224 6 9 0 0 2: -8 -8 IAH DFW 224 5 17 0 0 3: 3 3 IAH DFW 224 9 22 0 0 4: -3 5 IAH DFW 224 9 9 0 0

But you can also use column names, as they are evaluated in the scope of DT.

DT[UniqueCarrier=="AA"] #Returns all those rows where the Carrier is American Airlines Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime 1: 2011 1 1 6 1400 1500 AA 428 N576AA 60 2: 2011 1 2 7 1401 1501 AA 428 N557AA 60 3: 2011 1 3 1 1352 1502 AA 428 N541AA 70 4: 2011 1 4 2 1403 1513 AA 428 N403AA 70 5: 2011 1 5 3 1405 1507 AA 428 N492AA 62 --- 3240: 2011 12 27 2 1021 1333 AA 2234 N3ETAA 132 3241: 2011 12 28 3 1015 1329 AA 2234 N3FJAA 134 3242: 2011 12 29 4 1023 1335 AA 2234 N3GSAA 132 3243: 2011 12 30 5 1024 1334 AA 2234 N3BAAA 130 3244: 2011 12 31 6 1024 1343 AA 2234 N3HNAA 139 AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted 1: 40 -10 0 IAH DFW 224 7 13 0 0 2: 45 -9 1 IAH DFW 224 6 9 0 0 3: 48 -8 -8 IAH DFW 224 5 17 0 0 4: 39 3 3 IAH DFW 224 9 22 0 0 5: 44 -3 5 IAH DFW 224 9 9 0 0 --- 3240: 112 -12 1 IAH MIA 964 8 12 0 0 3241: 112 -16 -5 IAH MIA 964 9 13 0 0 3242: 110 -10 3 IAH MIA 964 12 10 0 0 3243: 110 -11 4 IAH MIA 964 9 11 0 0 3244: 119 -2 4 IAH MIA 964 8 12 0 0

Notice that you don’t have to use a comma for subsetting rows in a data table. In a data.frame doing this DF[2:5] would give all the rows of the 2nd to 5th column. Instead (as everyone reading this obviously knows), we have to specify DF[2:5,]. Also notice that DT[,2:5] does not mean anything for data tables, as is explained in the first question of the FAQs of the data.table package.
Quirky and useful: when subsetting rows you can also use the symbol .N in the DT[…] command, which is the number of rows or the last row. You can use it for selecting the last row or an offset from it.

DT[.N-1] #Returns the penultimate row of DT Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime 1: 2011 12 6 2 656 812 WN 621 N727SW 76 64 ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted 1: -13 -4 HOU TUL 453 3 9 0 0

The j part

The ‘j’ part is used to select columns and do stuff with them. And stuff can really mean anything. All kinds of functions can be used, which is a strong point of the data.table package.

DT[, mean(na.omit(ArrDelay))] [1] 7.094334

Notice that the ‘i’ part is left blank, and the first thing in the brackets is a comma. This might seem counterintuitive at first. However, this simply means that we do not subset on any rows, so all rows are selected. In the ‘j’ part, the average delay on arrival of all flights is calculated. It appears that the average plane of the hflights dataset had more than 7 minutes delay. Be prepared when catching your next flight!

When selecting several columns and doing stuff with them in the ‘j’ part, you need to use the ‘.()’ notation. This notation is actually just an alias to ‘list()’. It returns a data table, whereas not using ‘.()’ only returns a vector, as shown above.

DT[, .(mean(na.omit(DepDelay)), mean(na.omit(ArrDelay)))] V1 V2 1: 9.444951 7.094334

Another useful feature which requires the ‘.()’ notation allows you to rename columns inside the DT[…] command.

DT[, .(Avg_ArrDelay = mean(na.omit(ArrDelay)))] Avg_ArrDelay 1: 7.094334 DT[, .(Avg_DepDelay = mean(na.omit(DepDelay)), avg_ArrDelay = mean(na.omit(ArrDelay)))] Avg_DepDelay Avg_ArrDelay 1: 9.444951 7.094334

Of course, new column names are not obligatory.

Combining the above about ‘i’ and ‘j’ gives:

DT[UniqueCarrier=="AA", .(Avg_DepDelay = mean(na.omit(DepDelay)), Avg_ArrDelay = mean(na.omit(ArrDelay)), plot(DepTime,DepDelay,ylim=c(-15,200)), abline(h=0))] Avg_DepDelay Avg_ArrDelay V3  V4 1: 6.390144    0.8917558 NULL NULL


Here we took DT, selected all rows where the carrier was AA in the ‘i’ part, calculated the average delay on departure and on arrival, and plotted the time of departure against the delay on departure in the ‘j’ part.

To recap, the ‘j’ part is used to do calculations on columns specified in that part. As the columns of a data table are seen as variables, and the parts of ‘j’ are evaluated as expressions, virtually anything can be done in the ‘j’ part. This significantly shortens your programming time.

The by part

The final section of this data.table R tutorial focuses on the ‘by’ part. The ‘by’ part is used when we want to calculate the ‘j’ part grouped by a specific variable (or a manipulation of that variable). You will see that the ‘j’ expression is repeated for each ‘by’ group. It is simple to use: you just specify the column you want to group by in the ‘by’ argument.

DT[,mean(na.omit(DepDelay)),by=Origin] Origin V1 1: IAH 8.436951 2: HOU 12.837873

Here we calculated the average delay before departure, but grouped by where the plane is coming from.  It seems that flights departing from HOU have a larger average delay than those leaving from IAH.

Just as with the ‘j’ part, you can do a lot of stuff in the ‘by’ part. Functions can be used in the ‘by’ part so that results of the operations done in the ‘j’ part are grouped by something we specified in the DT[…] command. Using functions inside DT[…] makes that one line very powerful. Likewise, the ‘.()’ notation needs to be used when using several columns in the ‘by’ part.

DT[,.(Avg_DepDelay_byWeekdays = mean(na.omit(DepDelay))), by=.(Origin,Weekdays = DayOfWeek<6)] Origin Weekdays Avg_DepDelay_byWeekdays 1: IAH FALSE 8.286543 2: IAH TRUE 8.492484 3: HOU FALSE 10.965384 4: HOU TRUE 13.433994

Here, the average delay before departure of all planes (no subsetting in the ‘i’ part, so all rows are selected) was calculated first, and grouped secondly, first by origin of the plane and then by weekday. Weekdays is False in the weekends. It appears that the average delay before departure was larger when the plane left from HOU than from IAH, and surprisingly the delays were smaller in the weekends.

Putting it all together a typical DT[i,j,by] command gives:

DT[UniqueCarrier=="DL", .(Avg_DepDelay = mean(na.omit(DepDelay)), Avg_ArrDelay = mean(na.omit(ArrDelay)), Compensation = mean(na.omit(ArrDelay - DepDelay))), by = .(Origin, Weekdays = DayOfWeek<6)] Origin Weekdays Avg_DepDelay Avg_ArrDelay Compensation 1: IAH FALSE 8.979730 4.116751 -4.825719 2: HOU FALSE 7.120000 2.656566 -4.555556 3: IAH TRUE 9.270948 6.281941 -2.836609 4: HOU TRUE 11.631387 10.406593 -1.278388

Here the subset of planes flewn by Delta Air Lines (selected in ‘i’) was grouped by their origin and by Weekdays (in ‘by’). The time that was compensated in air was also calculated (in ‘j’). It appears that in the weekends, irrespective of the plane was coming from IAH or HOU, the time compensated while in air (thus by flying faster) is bigger.

There is much more to discover in the data table package, but this post illustrated the basic DT[i,j,by] command. The DataCamp course explains the whole data table package extensively. You can do the exercises at your own pace in your browser while getting hints and feedback, and review the videos and slides as much as you want. This interactive way of learning allows you to gain profound knowledge and practical experience with data tables.  Try it  for free.

Hopefully you know understand thanks to this data.table R tutorial the fundamental syntax of data.table, and are you ready to experiment yourself. If you have questions concerning the data.table package, have a look here. Matt and Arun are very active. One of the next blogposts on the data.table package will be more technical, zooming in on the wide possibilities with data tables. Stay tuned!

The post A data.table R tutorial by DataCamp: intro to DT[i, j, by] appeared first on DataCamp Blog.

To leave a comment for the author, please follow the link and comment on his blog: DataCamp Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R + ggplot2 Graph Catalog

Tue, 2015-02-03 08:33

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

Joanna Zhao’s and Jenny Bryan’s R graph catalog is meant to be a complement to the physical book, Creating More Effective Graphs, but it’s a really nice gallery in its own right. The catalog shows a series of different data visualizations, all made with R and ggplot2. Click on any of the plots and you get the R code necessary to generate the data and produce the plot.
You can use the panel on the left to filter by plot type, graphical elements, or the chapter of the book if you’re actually using it. All of the code and data used for this website is open-source, in this GitHub repository. Here's an example for plotting population demographic data by county that uses faceting to create small multiples:library(ggplot2)
library(reshape2)
library(grid)

this_base = "fig08-15_population-data-by-county"

my_data = data.frame(
Race = c("White", "Latino", "Black", "Asian American", "All Others"),
Bronx = c(194000, 645000, 415000, 38000, 40000),
Kings = c(855000, 488000, 845000, 184000, 93000),
New.York = c(703000, 418000, 233000, 143000, 39000),
Queens = c(733000, 556000, 420000, 392000, 128000),
Richmond = c(317000, 54000, 40000, 24000, 9000),
Nassau = c(986000, 133000, 129000, 62000, 24000),
Suffolk = c(1118000, 149000, 92000, 34000, 26000),
Westchester = c(592000, 145000, 123000, 41000, 23000),
Rockland = c(205000, 29000, 30000, 16000, 6000),
Bergen = c(638000, 91000, 43000, 94000, 18000),
Hudson = c(215000, 242000, 73000, 57000, 22000),
Passiac = c(252000, 147000, 60000, 18000, 12000))

my_data_long = melt(my_data, id = "Race",
variable.name = "county", value.name = "population")

my_data_long$county = factor(
my_data_long$county, c("New.York", "Queens", "Kings", "Bronx", "Nassau",
"Suffolk", "Hudson", "Bergen", "Westchester",
"Rockland", "Richmond", "Passiac"))

my_data_long$Race =
factor(my_data_long$Race,
rev(c("White", "Latino", "Black", "Asian American", "All Others")))

p = ggplot(my_data_long, aes(x = population / 1000, y = Race)) +
geom_point() +
facet_wrap(~ county, ncol = 3) +
scale_x_continuous(breaks = seq(0, 1000, 200),
labels = c(0, "", 400, "", 800, "")) +
labs(x = "Population (thousands)", y = NULL) +
ggtitle("Fig 8.15 Population Data by County") +
theme_bw() +
theme(panel.grid.major.y = element_line(colour = "grey60"),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
panel.margin = unit(0, "lines"),
plot.title = element_text(size = rel(1.1), face = "bold", vjust = 2),
strip.background = element_rect(fill = "grey80"),
axis.ticks.y = element_blank())

p

ggsave(paste0(this_base, ".png"),
p, width = 6, height = 8)

Keep in mind not all of these visualizations are recommended. You’ll find pie charts, ugly grouped bar charts, and other plots for which I can’t think of any sensible name. Just because you can use the add_cat() function from Hilary Parker’s cats package to fetch a random cat picture from the internet and create an annotation_raster layer to add to your ggplot2 plot, doesn’t necessarily mean you should do such a thing for a publication-quality figure. But if you ever needed to know how, this R graph catalog can help you out.library(ggplot2)

this_base = "0002_add-background-with-cats-package"

## devtools::install_github("hilaryparker/cats")
library(cats)
## library(help = "cats")

p = ggplot(mpg, aes(cty, hwy)) +
add_cat() +
geom_point()
p

ggsave(paste0(this_base, ".png"), p, width = 6, height = 5)

R graph catalog (via Laura Wiley)​Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Canberra IAPA Seminar – Text Analytics: Natural Language into Big Data – 17 February

Tue, 2015-02-03 06:44

(This article was first published on blog.RDataMining.com, and kindly contributed to R-bloggers)

Topic: Text Analytics: Natural Language into Big Data
Speaker: Dr. Leif Hanlen, Technology Director at NICTA
Date: Tuesday 17 February
Time: 5.30pm for a 6pm start
Cost: Nil
Where: SAS Offices, 12 Moore Street, Canberra, ACT 2600
Registration URL: http://www.iapa.org.au/Event/TextAnalyticsNaturalLanguageIntoBigData

Abstract:
We outline several activities in NICTA relating to understanding and mining free text. Our approach is to develop agile service-focussed solutions that provide insight into large text corpora, and allow end users to incorporate current text documents into standard numerical analysis technologies.

Biography:
Dr. Leif Hanlen is Technology Director at NICTA, Australia’s largest ICT research centre. Leif is also an adjunct Associate Professor of ICT at the Australian National University and an adjunct Professor of Health at the University of Canberra. He received a BEng (Hons I) in electrical engineering, BSc (Comp Sci) and PhD (telecomm) from the University of Newcastle Australia. His research focusses on applications Machine Learning to text processing.

Please feel free to forward this invite to your friends and colleagues who might be interested. Thanks.


To leave a comment for the author, please follow the link and comment on his blog: blog.RDataMining.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

BayesFactor version 0.9.10 released to CRAN

Tue, 2015-02-03 03:04

(This article was first published on BayesFactor: Software for Bayesian inference, and kindly contributed to R-bloggers)

If you're running Solaris (yes, all zero of you) you'll have to wait for version 0.9.10-1, due to a small issue preventing the Solaris package from building on CRAN. The rest of you can pick up the updated version on CRAN today.

See below the fold for changes.

CHANGES IN BayesFactor VERSION 0.9.10CHANGES
  • Fixed bug in model enumeration code in generalTestBF (affected "withmain" analyses with neverExclude argument)
  • Various bug fixes
  • Analyses are more error tolerant (problem analyses will yield NA in BayesFactor object)
  • Fixed some typos in citation information
  • Improved numerical stability of proportional error estimates

To leave a comment for the author, please follow the link and comment on his blog: BayesFactor: Software for Bayesian inference. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R in Insurance 2015: Registration Opened

Tue, 2015-02-03 02:22

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

The registration for the third conference on R in Insurance on Monday 29 June 2015 at the University of Amsterdam has opened.


This one-day conference will focus again on applications in insurance and actuarial science that use R, the lingua franca for statistical computation.

The intended audience of the conference includes both academics and practitioners who are active or interested in the applications of R in insurance.

Invited talks will be given by:
  • Prof. Richard Gill, Leiden University
  • Dr James Guszcza, FCAS, Chief Data Scientist, Deloitte - US
Perhaps you have seen Richard's TED talk on Statistical Errors in Court or Jim's talk on Predictive Modelling and Behaviour Insight or Actuarial Analytics in R. We are thrilled that they accepted our invitation.

We invite you to submit a one-page abstract for consideration. Both academic and practitioner proposals related to R are encouraged. The submission deadline for abstracts is 28 March 2015.
Please email your abstract of no more than 300 words (in text or pdf format) to r-in-insurance@uva.nl.

Details about the registration and abstract submission are given on the dedicated R in Insurance page at the University of Amsterdam.

For more information about the past events visit www.rininsurance.com. This post was originally published at mages' blog.

To leave a comment for the author, please follow the link and comment on his blog: mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Shiny for Interactive Application Development using R

Mon, 2015-02-02 17:48

(This article was first published on Adventures in Analytics and Visualization, and kindly contributed to R-bloggers)

This slidify-based deck introduces the shiny package from R-Studio and walks one through the development of an interactive application that presents users with options to subset the iris dataset, generate a summary of the resulting dataset, and determine variables for a scatter plot and a box plot. Please click on this link to visit the blog post: http://patilv.com/shinyapp/

To leave a comment for the author, please follow the link and comment on his blog: Adventures in Analytics and Visualization. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Paris’s history, captured in its streets

Mon, 2015-02-02 16:42

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The following image by Mathieu Rajerison has been doing the rounds of French media recently. It shows the streets of Paris, color-coded by their compass direction. It's been featured in an article in Telerama magazine, and even on French TV Channel LCI (skip ahead to 8:20 in the linked video. which also features an interview with Mathieu).

Mathieu used the R language and OpenStreetMap data to construct the image, which colorizes each street according to the compass direction it points. Orthogonal streets are colored the same, so regular grids appear as swathes of uniform color. A planned city like Chicago, would appear as a largely monochrome grid, but Paris exhibits much more variation. (You can see many other cities in this DataPointed.net article.) As this article in the French edition of Slate explains, the very history of Paris itself is encapsulated in the colored segments. You can easily spot Napoleon's planned boulevards as they cut through the older medieval neighborhoods, and agglomerated villages like Montmartre appear as rainbow-hued nuggets. 

Mathieu explains the process of creating the chart in a blog post written in English. He uses the maptools package to import the OpenStreetMap shapefile and to extract the orientations of the streets. A simple R function is used to select colors for the streets, and then the entire map is sampled to a grid with the spatstat package, before finally being exported as a TIFF by the raster package. The entire chart is created with just 31 lines of R code, which you can find at the link below.

Data and GIS tips: Streets of Paris Colored by Orientation

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Should you teach Python or R for data science?

Mon, 2015-02-02 13:34

(This article was first published on R - Data School, and kindly contributed to R-bloggers)

Last week, I published a post titled Lessons learned from teaching an 11-week data science course, detailing my experiences and recommendations from teaching General Assembly's 66-hour introductory data science course.

In the comments, I received the following question:

I'm part of a team developing a course, with NSF support, in data science. The course will have no prerequisites and will be targeted for non-technical majors, with a goal to show how useful data science can be in their own area. Some of the modules we are developing include, for example, data cleansing, data mining, relational databases and NoSQL data stores. We are considering as tools the statistical environment R and Python and will likely develop two versions of this course. For now, we'd appreciate your sense of the relative merits of those two environments. We are hoping to get a sense of what would be more appropriate for computer and non computer science students, so if you have a sense of what colleagues that you know would prefer, that also would be helpful.

That's an excellent question! It doesn't have a simple answer (in my opinion) because both languages are great for data science, but one might be better than the other depending upon your students and your priorities.

At General Assembly in DC, we currently teach the course entirely in Python, though we used to teach it in both R and Python. I also mentor data science students in R, and I'm a teaching assistant for online courses in both R and Python. I enjoy using both languages, though I have a slight personal preference for Python specifically because of its machine learning capabilities (more details below).

Here are some questions that might help you (as educators or curriculum developers) to assess which language is a better fit for your students:

Do your students have experience programming in other languages?

If your students have some programming experience, Python may be the better choice because its syntax is more similar to other languages, whereas R's syntax is thought to be unintuitive by many programmers. If your students don't have any programming experience, I think both languages have an equivalent learning curve, though many people would argue that Python is easier to learn because its code reads more like regular human language.

Do your students want to go into academia or industry?

In academia, especially in the field of statistics, R is much more widely used than Python. In industry, the data science trend is slowly moving from R towards Python. One contributing factor is that companies using a Python-based application stack can more easily integrate a data scientist who writes Python code, since that eliminates a key hurdle in "productionizing" a data scientist's work.

Are you teaching "machine learning" or "statistical learning"?

The line between these two terms is blurry, but machine learning is concerned primarily with predictive accuracy over model interpretability, whereas statistical learning places a greater priority on interpretability and statistical inference. To some extent, R "assumes" that you are performing statistical learning and makes it easy to assess and diagnose your models. scikit-learn, by far the most popular machine learning package for Python, is more concerned with predictive accuracy. (For example, scikit-learn makes it very easy to tune and cross-validate your models and switch between different models, but makes it much harder than R to actually "examine" your models.) Thus, R is probably the better choice if you are teaching statistical learning, though Python also has a nice package for statistical modeling (Statsmodels) that duplicates some of R's functionality.

Do you care more about the ease with which students can get started in machine learning, or the ease with which they can go deeper into machine learning?

In R, getting started with your first model is easy: read your data into a data frame, use a built-in model (such as linear regression) along with R's easy-to-read formula language, and then review the model's summary output. In Python, it can be much more of a challenging process to get started simply because there are so many choices to make: How should I read in my data? Which data structure should I store it in? Which machine learning package should I use? What type of objects does that package allow as input? What shape should those objects be in? How do I include categorical variables? How do I access the model's output? (Et cetera.) Because Python is a general purpose programming language whereas R specializes in a smaller subset of statistically-oriented tasks, those tasks tend to be easier to do (at least initially) in R.

However, once you have mastered the basics of machine learning in Python (using scikit-learn), I find that machine learning is actually a lot easier in Python than in R. scikit-learn provides a clean and consistent interface to tons of different models. It provides you with many options for each model, but also chooses sensible defaults. Its documentation is exceptional, and it helps you to understand the models as well as how to use them properly. It is also actively being developed.

In R, switching between different models usually means learning a new package written by a different author. The interface may be completely different, the documentation may or may not be helpful in learning the package, and the package may or may not be under active development. (caret is an excellent R package that attempts to provide a consistent interface for machine learning models in R, but it's nowhere near as elegant a solution as scikit-learn.) In summary, machine learning in R tends to be a more tiresome experience than machine learning in Python once you have moved beyond the basics. As such, Python may be a better choice if students are planning to go deeper into machine learning.

Do your students care about learning a "sexy" language?

R is not a sexy language. It feels old, and its website looks like it was created around the time the web was invented. Python is the "new kid" on the data science block, and has far more sex appeal. From a marketing perspective, Python may be the better choice simply because it will attract more students.

How computer savvy are your students?

Installing R is a simple process, and installing RStudio (the de facto IDE for R) is just as easy. Installing new packages or upgrading existing packages from CRAN (R's package management system) is a trivial process within RStudio, and even installing packages hosted on GitHub is a simple process thanks to the devtools package.

By comparison, Python itself may be easy to install, but installing individual Python packages can be much more challenging. In my classroom, we encourage students to use the Anaconda distribution of Python, which includes nearly every Python package we use in the course and has a package management system similar to CRAN. However, Anaconda installation and configuration problems are still common in my classroom, whereas these problems were much more rare when using R and RStudio. As such, R may be the better choice if your students are not computer savvy.

Is data cleaning a focus of your course?

Data cleaning (also known as "data munging") is the process of transforming your raw data into a more meaningful form. I find data cleaning to be easier in Python because of its rich set of data structures, as well as its far superior implementation of regular expressions (which are often necessary for cleaning text).

Is data exploration a focus of your course?

The pandas package in Python is an extremely powerful tool for data exploration, though its power and flexibility can also make it challenging to learn. R's dplyr is more limited in its capabilities than pandas (by design), though I find that its more focused approach makes it easier to figure out how to accomplish a given task. As well, dplyr's syntax is more readable and thus is easier for me to remember. Although it's not a clear differentiator, I would consider R a slightly easier environment for getting started in data exploration due to the ease of learning dplyr.

Is data visualization a focus of your course?

R's ggplot2 is an excellent package for data visualization. Once you understand its core principles (its "grammar of graphics"), it feels like the most natural way to build your plots, and it becomes easy to produce sophisticated and attractive plots. Matplotlib is the de facto standard for scientific plotting in Python, but I find it tedious both to learn and to use. Alternatives like Seaborn and pandas plotting still require you to know some Matplotlib, and the alternative that I find most promising (ggplot for Python) is still early in development. Therefore, I consider R the better choice for data visualization.

Is Natural Language Processing (NLP) part of your curriculum?

Python's Natural Language Toolkit (NLTK) is a mature, well-documented package for NLP. TextBlob is a simpler alternative, spaCy is a brand new alternative focused on performance, and scikit-learn also provides some supporting functionality for text-based feature extraction. In comparison, I find R's primary NLP framework (the tm package) to be significantly more limited and harder to use. Even if there are additional R packages that can fill in the gaps, there isn't one comprehensive package that you can use to get started, and thus Python is the better choice for teaching NLP.

If you are a data science educator, or even just a data scientist who uses R or Python, I'd love to hear from you in the comments! On which points above do you agree or disagree? What are some important factors that I have left out? What language do you teach in the classroom, and why?

I look forward to this conversation!

P.S. Want to hear about new Data School posts or video tutorials? Subscribe to my newsletter.
P.P.S. Want to Tweet about this post? Here's a Tweet you can RT.

To leave a comment for the author, please follow the link and comment on his blog: R - Data School. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

My "Top 5 R Functions"

Mon, 2015-02-02 10:56

(This article was first published on Minding the Brain, and kindly contributed to R-bloggers)

In preparation for a R Workgroup meeting, I started thinking about what would be my "Top 5 R Functions". I ruled out the functions for basic mechanics - save, load, mean, etc. - they're obviously critical, but every programming language has them, so there's nothing especially "R" about them. I also ruled out the fancy statistical analysis functions like (g)lmer -- most people (including me) start using R because they want to run those analyses so it seemed a little redundant. I started using R because I wanted to do growth curve analysis, so it seems like a weak endorsement to say that I like R because it can do growth curve analysis. No, I like R because it makes (many) somewhat complex data operations really, really easy. Understanding how take advantage of these R functions is what transformed my view of R from purely functional (I need to do analysis X and R has functions for doing analysis X) to an all-purpose tool that allows me to do data processing, management, analysis, and visualization extremely quickly and easily. So, here are the 5 functions that did that for me:

  1. subset() for making subsets of data (natch)
  2. merge() for combining data sets in a smart and easy way
  3. melt() for converting from wide to long data formats
  4. dcast() for converting from long to wide data formats, and for making summary tables
  5. ddply() for doing split-apply-combine operations, which covers a huge swath of the most tricky data operations 
For anyone interested, I posted my R Workgroup notes on how to use these functions on RPubs. Side note: after a little configuration, I found it super easy to write these using knitr, "knit" them into a webpage, and post that page on RPubs.
Conspicuously missing from the above list is ggplot, which I think deserves a special lifetime achievement award for how it has transformed how I think about data exploration and data visualization. I'm planning that for the next R Workgroup meeting.

To leave a comment for the author, please follow the link and comment on his blog: Minding the Brain. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Embedding R-generated Interactive HTML pages in MS PowerPoint

Mon, 2015-02-02 09:18

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Richard Pugh – Commercial Director, UK. 

Usually when I create slide decks these days I used markdown and slidy.  However, I recently was asked to present using an existing Revolution Microsoft PowerPoint template.

Trouble is, I’ve been spoilt with the advantages of using a HTML-based presentation technology and I wanted to include some interactive web elements.  In particular, I wanted to use a motion chart generated with the fantastic googleVis package.  Of course, that presented an issue – how was I to include some interactive HTML elements in my PowerPoint deck?

The answer turned out to involve a PowerPoint plug-in called LiveWeb.  These were the steps I took:

 

  • Download LiveWeb from http://skp.mvps.org/liveweb.htm and install it.  This adds a couple of buttons onto the PowerPoint tool bar (for me, it appears in the “Insert” part of the ribbon)

 

 

 

  • Generate your web content.  In my version, that meant using googleVis to generate a web page
  • Use the LiveWeb plugin to point your slide to the web page
  • Click play and wave hands like Hans Rosling which rapidly talking about your slide:

 

Btw – this also works for other HTML content, such as Shiny apps.  Here’s one from the RStudio Shiny Example page …

 

So, if you have want to use MS PowerPoint, it is still possible to include R-generated interactive HTML content using the above steps.

 

To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

RcppStreams 0.1.0

Mon, 2015-02-02 08:37

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The new package RcppStreams arrived on CRAN on Saturday. RcppStreams brings the excellent Streamulus C++ template library for event stream processing to R.

Streamulus, written by Irit Katriel, uses very clever template meta-programming (via Boost Fusion) to implement an embedded domain-specific event language created specifically for event stream processing.

The packages wraps the four existing examples by Irit, all her unit tests and includes a slide deck from a workshop presentation. The framework is quite powerful, and it brings a very promising avenue for (efficient !!) stream event processing to R.

The NEWS file entries follows below:

Changes in version 0.1.0 (2015-01-30)
  • First CRAN release

  • Contains all upstream examples and documentation from Streamulus

  • Added Rcpp Attributes wrapping and minimal manual pages

  • Added Travis CI unit testing

Courtesy of CRANberries, there is also a copy of the DESCRIPTION file for this initial release. More detailed information is on the RcppStreams page page and of course on the Streamulus page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

2015.2: Did the New England Patriots experience a decrease in fumbles starting in 2007?

Sun, 2015-02-01 17:14

(This article was first published on SAS and R, and kindly contributed to R-bloggers)

Here's a timely guest entry from Jeffrey Witmer (Oberlin College).

As the “Deflate Gate” saga was unfolding, Warren Sharp analyzed “touches per fumble” for NFL teams before and after 2006, when a rule was changed so that teams playing on the road could provide their own footballs (http://www.sharpfootballanalysis.com/blog/). Sharp noted that, for whatever reason, the Patriots went from being a typical team, as regards fumbling, to a team with a very low fumble rate. Rather than rely on the data the Sharp provides at his website, I choose to collect and analyze some data on my own. I took a random sample of 30 games played by New England and 30 other games. For each game, I recorded all rushing and passing plays (except for QB kneels), but excluded kicking plays (the NFL, rather than the teams, provides special footballs for those plays). (Data source: http://www.pro-football-reference.com/play-index/play_finder.cgi.) I also recorded the weather for the game. (Data source: http://www.wunderground.com/history/.) Once I had the data (in a file that I called AllBig, which can be downloaded from http://www.amherst.edu/~nhorton/AllBig.csv), I noted whether or not there was a fumble on each play, aided by the grep() command:
grep("Fumb", AllBig$Detail, ignore.case=TRUE)
I labeled each play as Late or not according to whether it happened after the rule change:
AllBig$Late <- ifelse(AllBig$Year > 2006, 1, 0)
Now for the analysis. My data set has 7558 plays including 145 fumbles (1.9%). I used the mosaic package and the tally() command to see how often teams other than the Patriots fumble:
require(mosaic)
tally(~Fumble+Late, data=filter(AllBig,Pats==0))

Late
Fumble 0 1
0 2588 2919
1 54 65
Then I asked for the data in proportion terms:
tally(Fumble~Late, data=filter(AllBig,Pats==0))
and got
Late
Fumble 0 1
0 0.9796 0.9782
1 0.0204 0.0218
For non-Pats there is a tiny increase in fumbles. This can be displayed graphically using a mosaiplot (though it's not a particularly compelling figure). mosaicplot(Fumble~Late, data=filter(AllBig,Pats==0)) Repeating this for the Patriots shows a different picture:
tally(~Fumble+Late, data=filter(AllBig,Pats==1))
Late
Fumble 0 1
0 996 910
1 19 7


tally(Fumble~Late, data=filter(AllBig,Pats==1))
Late
Fumble 0 1
0 0.98128 0.99237
1 0.01872 0.00763
I fit a logistic regression model with the glm() command: glm(Fumble~Late*Pats, family=binomial, data=AllBig)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.8697 0.1375 -28.14 <2e-16 ***
Late 0.0650 0.1861 0.35 0.727
Pats -0.0897 0.2693 -0.33 0.739
Late:Pats -0.9733 0.4819 -2.02 0.043 *
I wanted to control for any weather effect, so I coded the weather as Bad if it was raining or snowing and good if not. This led to a model that includes BadWeather and Temperature – which turn out not to make much of a difference:
AllBig$BadWeather <- ifelse(AllBig$Weather %in% c("drizzle","rain","snow"), 1, 0)

glm(formula = Fumble ~ BadWeather + Temp + Late * Pats,
family = binomial, data = AllBig)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.23344 0.43164 -9.81 <2e-16 ***
BadWeather 0.33259 0.29483 1.13 0.26
Temp 0.00512 0.00612 0.84 0.40
Late 0.08871 0.18750 0.47 0.64
Pats -0.14183 0.27536 -0.52 0.61
Late:Pats -0.91062 0.48481 -1.88 0.06 .
Because there was suspicion that something changed starting in 2007 I added a three-way interaction:
glm(formula = Fumble ~ BadWeather + Temp + IsAway * Late * Pats,
family = binomial, data = AllBig)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.51110 0.47707 -9.46 <2e-16 ***
BadWeather 0.34207 0.30013 1.14 0.254
Temp 0.00831 0.00653 1.27 0.203
IsAway 0.14791 0.27549 0.54 0.591
Late 0.13111 0.26411 0.50 0.620
Pats -0.80019 0.54360 -1.47 0.141
IsAway:Late -0.07348 0.37463 -0.20 0.845
IsAway:Pats 0.94335 0.63180 1.49 0.135
Late:Pats 0.51536 0.71379 0.72 0.470
IsAway:Late:Pats -3.14345 1.29480 -2.43 0.015 *
There is some evidence here that the Patriots fumble less than the rest of the NFL and that things changed in 2007. The p-values above are based on asymptotic normality, but there is a cleaner and easier way to think about the Patriots’ fumble rate. I wrote a short simulation that mimics something I do in my statistics classes, where I use a physical deck of cards to show what each step in the R simulation is doing.
#Simulation of deflategate data null hypothesis
Late = rep(1,72) #creates 72 late fumbles
Early = rep(0,73) #creates 73 early fumbles
alldata = append(Late,Early) #puts the two groups together
table(alldata) #check to see that we have what we want

cards =1:length(alldata) # creates 145 cards, one "ID number" per fumble

FumbleLate = NULL # initializes a vector to hold the results
for (i in 1:10000){# starts a loop that will be executed 10,000 times
cardsgroup1 = sample(cards,119, replace=FALSE) # takes a sample of 119 cards
cardsgroup2 = cards[-cardsgroup1] # puts the remaining cards in group 2
NEPats = (alldata[cardsgroup2]) #reads the values of the cards in group 2
FumbleLate[i] = sum(NEPats) # counts NEPats late fumbles (the only stat we need)
}

table(FumbleLate) #look at the results
hist(FumbleLate, breaks=seq(2.5,23.5)) #graph the results

sum(FumbleLate <= 7)/10000 # How rare is 7 (or fewer)? Answer: around 0.0086
Additional note: kudos to Steve Taylor for the following graphical depiction of the interaction.

An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.

To leave a comment for the author, please follow the link and comment on his blog: SAS and R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New release (0.7) of metricsgraphics htmlwidget — grids & rollovers

Sun, 2015-02-01 10:34

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

I’ve updated my metricsgraphics package to version 0.7. The core MetricsGraphics JavaScript library has been updated to version 2.1.0 (from 1.1.0). Two blog-worthy features since releasing version 0.5 are mjs_grid (which is a grid.arrange-like equivalent for metricsgraphics plots and mjs_add_rollover which lets you add your own custom rollover text to the plots.

The Grid

The grid.arrange (and arrangeGrob) functions from the gridExtra package come in handy when combining ggplot2 charts. I wanted a similar way to arrange independent or linked metricsgraphics charts, hence mjs_grid was born.

mjs_grid uses the tag functions in htmltools to arrange metricsgraphics plot objects into an HTML <table> structure. At present, only uniform tables are supported, but I’m working on making the grid feature more flexible (just like grid.arrange). The current functionality is pretty straightforward:

  • You build individual metricsgraphics plots;
  • Optionally combine them in a list;
  • Pass in the plots/lists into mjs_grid;
  • Tell mjs_grid how many rows & columns are in the grid; and
  • Specify the column widths

But, code > words, so here are some examples. To avoid code repetition, note that you’ll need the following packages available to run most of the snippets below:

library(metricsgraphics) library(htmlwidgets) library(htmltools) library(dplyr)

First, we’ll combine a few example plots:

tmp <- data.frame(year=seq(1790, 1970, 10), uspop=as.numeric(uspop)) tmp %>% mjs_plot(x=year, y=uspop, width=300, height=300) %>% mjs_line() %>% mjs_add_marker(1850, "Something Wonderful") %>% mjs_add_baseline(150, "Something Awful") -> mjs1   mjs_plot(rnorm(10000), width=300, height=300) %>% mjs_histogram(bins=30, bar_margin=1) -> mjs2   movies <- ggplot2::movies[sample(nrow(ggplot2::movies), 1000), ] mjs_plot(movies$rating, width=300, height=300) %>% mjs_histogram() -> mjs3   tmp %>% mjs_plot(x=year, y=uspop, width=300, height=300) %>% mjs_line(area=TRUE) -> mjs4   mjs_grid(mjs1, mjs2, mjs3, mjs4, ncol=2, nrow=2)

Since your can pass a list as a parameter, you can generate many (similar) plots and then grid-display them without too much code. This one generates 7 random histograms with linked rollovers and displays them in grid. Note that this example has mjs_grid using the same algorithm grid.arrange does for auto-computing “optimal” grid size.

lapply(1:7, function(x) { mjs_plot(rnorm(10000, mean=x/2, sd=x), width=250, height=250, linked=TRUE) %>% mjs_histogram(bar_margin=2) %>% mjs_labs(x_label=sprintf("Plot %d", x)) }) -> plots   mjs_grid(plots)

And, you can use do from dplyr to get ggplot2 facet_-like behavior (though, one could argue that interactive graphics should use controls/selectors vs facets). This example uses the tips dataset from reshape2 and creates a list of plots that are then passed to mjs_grid:

tips <- reshape2::tips a <- tips %>% mutate(percent=tip/total_bill, day=factor(day, levels=c("Thur", "Fri", "Sat", "Sun"), ordered=TRUE)) %>% group_by(day) %>% do( plot={ day_label <- unique(.$day) mjs_plot(., x=total_bill, y=percent, width=275, height=275, left=100) %>% mjs_point(color_accessor=sex, color_type="category") %>% mjs_labs(x_label=sprintf("Total Bill (%s)", day_label), y_label="Tip %") })   mjs_grid(a$plot, ncol=2, nrow=2, widths=c(0.5, 0.5)) Rollovers

I’ve had a few requests to support the use of different rollovers and this is a first stab at exposing MetricsGraphics’ native functionality to users of the metricsgraphics package. The API changed from MG 1.1.0 to 2.2.0, so I’m kinda glad I waited for this. It requires knowledge of javascript, D3 and the use of {{ID}} as part of the CSS node selector when targeting the MetricsGraphics SVG element that displays the rollover text. Here is a crude, but illustrative example of how to take advantage of this feature (mouseover the graphics to see the altered text):

set.seed(1492) dat <- data.frame(date=seq(as.Date("2014-01-01"), as.Date("2014-01-31"), by="1 day"), value=rnorm(n=31, mean=0, sd=2))   dat %>% mjs_plot(x=date, y=value, width=500, height=300) %>% mjs_line() %>% mjs_axis_x(xax_format = "date") %>% mjs_add_mouseover("function(d, i) { $('{{ID}} svg .mg-active-datapoint') .text('custom text : ' + d.date + ' ' + i); }") Postremo

If you are using metricsgraphics, drop a link in the comments here to show others how you’re using it! If you need/want some functionality (I’m hoping to get xts support into the 0.8 release) that isn’t already in existing feature requests or something’s broken for you, post a new issue on github.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Some 3D Graphics (rgl) for Classification with Splines and Logistic Regression (from The Elements of Statistical Learning)

Sun, 2015-02-01 10:01

(This article was first published on BioStatMatt » R, and kindly contributed to R-bloggers)

This semester I'm teaching from Hastie, Tibshirani, and Friedman's book, The Elements of Statistical Learning, 2nd Edition. The authors provide a Mixture Simulation data set that has two continuous predictors and a binary outcome. This data is used to demonstrate classification procedures by plotting classification boundaries in the two predictors. For example, the figure below is a reproduction of Figure 2.5 in the book:

The solid line represents the Bayes decision boundary (i.e., {x: Pr("orange"|x) = 0.5}), which is computed from the model used to simulate these data. The Bayes decision boundary and other boundaries are determined by one or more surfaces (e.g., Pr("orange"|x)), which are generally omitted from the graphics. In class, we decided to use the R package rgl to create a 3D representation of this surface. Below is the code and graphic (well, a 2D projection) associated with the Bayes decision boundary:

library(rgl) load(url("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/ESL.mixture.rda")) dat <- ESL.mixture     ## create 3D graphic, rotate to view 2D x1/x2 projection par3d(FOV=1,userMatrix=diag(4)) plot3d(dat$xnew[,1], dat$xnew[,2], dat$prob, type="n", xlab="x1", ylab="x2", zlab="", axes=FALSE, box=TRUE, aspect=1)   ## plot points and bounding box x1r <- range(dat$px1) x2r <- range(dat$px2) pts <- plot3d(dat$x[,1], dat$x[,2], 1, type="p", radius=0.5, add=TRUE, col=ifelse(dat$y, "orange", "blue")) lns <- lines3d(x1r[c(1,2,2,1,1)], x2r[c(1,1,2,2,1)], 1)   ## draw Bayes (True) decision boundary; provided by authors dat$probm <- with(dat, matrix(prob, length(px1), length(px2))) dat$cls <- with(dat, contourLines(px1, px2, probm, levels=0.5)) pls <- lapply(dat$cls, function(p) lines3d(p$x, p$y, z=1))   ## plot marginal (w.r.t mixture) probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, dat$prob, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

In the above graphic, the probability surface is represented in gray, and the Bayes decision boundary occurs where the plane f(x) = 0.5 (in light gray) intersects with the probability surface.

Of course, the classification task is to estimate a decision boundary given the data. Chapter 5 presents two multidimensional splines approaches, in conjunction with binary logistic regression, to estimate a decision boundary. The upper panel of Figure 5.11 in the book shows the decision boundary associated with additive natural cubic splines in x1 and x2 (4 df in each direction; 1+(4-1)+(4-1) = 7 parameters), and the lower panel shows the corresponding tensor product splines (4x4 = 16 parameters), which are much more flexible, of course. The code and graphics below reproduce the decision boundaries shown in Figure 5.11, and additionally illustrate the estimated probability surface (note: this code below should only be executed after the above code, since the 3D graphic is modified, rather than created anew):

Reproducing Figure 5.11 (top):

## clear the surface, decision plane, and decision boundary par3d(userMatrix=diag(4)); pop3d(id=sfc); pop3d(id=qds) for(pl in pls) pop3d(id=pl)   ## fit additive natural cubic spline model library(splines) ddat <- data.frame(y=dat$y, x1=dat$x[,1], x2=dat$x[,2]) form.add <- y ~ ns(x1, df=3)+ ns(x2, df=3) fit.add <- glm(form.add, data=ddat, family=binomial(link="logit"))   ## compute probabilities, plot classification boundary probs.add <- predict(fit.add, type="response", newdata = data.frame(x1=dat$xnew[,1], x2=dat$xnew[,2])) dat$probm.add <- with(dat, matrix(probs.add, length(px1), length(px2))) dat$cls.add <- with(dat, contourLines(px1, px2, probm.add, levels=0.5)) pls <- lapply(dat$cls.add, function(p) lines3d(p$x, p$y, z=1))   ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.add, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

Reproducing Figure 5.11 (bottom)

## clear the surface, decision plane, and decision boundary par3d(userMatrix=diag(4)); pop3d(id=sfc); pop3d(id=qds) for(pl in pls) pop3d(id=pl)   ## fit tensor product natural cubic spline model form.tpr <- y ~ 0 + ns(x1, df=4, intercept=TRUE): ns(x2, df=4, intercept=TRUE) fit.tpr <- glm(form.tpr, data=ddat, family=binomial(link="logit"))   ## compute probabilities, plot classification boundary probs.tpr <- predict(fit.tpr, type="response", newdata = data.frame(x1=dat$xnew[,1], x2=dat$xnew[,2])) dat$probm.tpr <- with(dat, matrix(probs.tpr, length(px1), length(px2))) dat$cls.tpr <- with(dat, contourLines(px1, px2, probm.tpr, levels=0.5)) pls <- lapply(dat$cls.tpr, function(p) lines3d(p$x, p$y, z=1))   ## plot probability surface and decision plane sfc <- surface3d(dat$px1, dat$px2, probs.tpr, alpha=1.0, color="gray", specular="gray") qds <- quads3d(x1r[c(1,2,2,1)], x2r[c(1,1,2,2)], 0.5, alpha=0.4, color="gray", lit=FALSE)

Although the graphics above are static, it is possible to embed an interactive 3D version within a web page (e.g., see the rgl vignette; best with Google Chrome), using the rgl function writeWebGL. I gave up on trying to embed such a graphic into this WordPress blog post, but I have created a separate page for the interactive 3D version of Figure 5.11b. Duncan Murdoch's work with this package is reall nice!

To leave a comment for the author, please follow the link and comment on his blog: BioStatMatt » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R package “fishdynr”

Sun, 2015-02-01 09:20

(This article was first published on me nugget, and kindly contributed to R-bloggers)


The fishdynr package allows for the construction of some basic population dynamics models commonly used in fisheries science. Included are models of a single cohort, cohortSim, and a more complex iterative model that incorporates a stock-recruitment relationship, stockSim. The model functions require a list of parameters as the main argument, which contains information about the given population's dynamics (growth, mortality, recruitment, etc.) and fishery (e.g. selectivity function and related parameters). This allows for a great deal of flexibility in adapting the analyses to particular species or fishery by defining functions of growth, mortality, fishing selectivity, recruitment, etc., outside the analysis. The package (located on GitHub) can be easily installed via the install_github function of the devtools package:

library(devtools)
install_github("fishdynr", "marchtaylor")

The above figure shows the output of the Beverton and Holt's (1957) yield-per-recruit model (ypr function - wrapper for cohortSim), based on variable fishing mortality (F) and length at first capture (Lcap, knife-edge selection). Length at maturity is shown as a dashed white line for reference. In this example, the maximum yield (Ymax) is defined as the maximum possible yield without depletion of the spawning biomass below 50% of it's virgin, unfished biomass. This simple cohort model has many additional outputs that can be helpful in visualizing processes of growth and mortality:

I hope to continue to document different models used in fisheries science within the package. Any suggestions or comments are welcome.

References
Beverton, R. J. H.; Holt, S. J. (1957), On the Dynamics of Exploited Fish Populations, Fishery Investigations Series II Volume XIX, Ministry of Agriculture, Fisheries and Food

Script to reproduce the examples

#library(devtools)
#install_github("fishdynr", "marchtaylor")
 
library(fishdynr)
 
# Single cohort simulation
data(tilapia)
tilapia$N0 <- 1000
res <- cohortSim(tilapia, t_incr=0.1)
png("fishdynr_cohortSim.png", width=8, height=5, units="in", res=400, type="cairo")
op <- par(mfcol=c(2,3), mar=c(4,4,1,1), ps=10)
plot(Lt ~ t, res, t="l")
plot(Wt ~ t, res, t="l")
plot(pmat ~ t, res, t="l")
plot(pcap ~ t, res, t="l")
plot(Nt ~ t, res, t="l", log="y")
lines(Nt.noF ~ t, res, col=1, lty=2)
legend("topright", legend=c("Nt", "Nt; F=0"), col=1, lty=1:2)
plot(Bt ~ t, res, t="l")
lines(SBt ~ t, res, col=1, lty=2)
legend("topright", legend=c("Bt", "SBt"), col=1, lty=1:2)
par(op)
dev.off()
 
# Yield per recruit model
tilapia$N0 <- 1
n <- 30
adj.params <- list(F=seq(0,3,,n), knife_edge_size=seq(0,tilapia$Linf,,n))
res <- ypr(params=tilapia, adj.params)
SB_F0 <- res$SB[which(res$F==0),1]
res$relSB <- res$SB/SB_F0
Ymax <- which.max(res$Y*(res$relSB > 0.5))
 
pal <- colorRampPalette(c(
rgb(1,0.5,0.5), rgb(1,1,0.5), rgb(0.5,1,1), rgb(0.5,0.5,1)
))
 
png("fishdynr_ypr.png", width=8, height=4, units="in", res=400, type="cairo")
op <- par(mfcol=c(1,2), mar=c(3,3,2,1), oma=c(1,1,0,0), ps=10)
# Yield
image(x=res$F, y=res$knife_edge,
z=res$Y, col=pal(100),
xlab="", ylab=""
)
contour(x=res$F, y=res$knife_edge, z=res$Y, add=TRUE)
abline(h=tilapia$Lmat, col="white", lty=2)
points(res$adj.params.comb[Ymax,], lwd=2, col="white")
text(res$adj.params.comb[Ymax,], labels="Ymax", col="white", cex=1.2, pos=2, font=2)
mtext("Yield-Per-Recruit", side=3, line=0.5, cex=1.2, font=2)
# Relative spawning biomass
image(x=res$F, y=res$knife_edge,
z=res$SB/SB_F0, col=pal(100),
xlab="", ylab=""
)
contour(x=res$F, y=res$knife_edge, z=res$SB/SB_F0, add=TRUE)
abline(h=tilapia$Lmat, col="white", lty=2)
points(res$adj.params.comb[Ymax,], lwd=2, col="white")
text(res$adj.params.comb[Ymax,], labels="Ymax", col="white", cex=1.2, pos=2, font=2)
mtext("Rel. Spawning Biomass", line=0.5, side=3, cex=1.2, font=2)
mtext("Fishing mortality [F]", side=1, line=0, outer=TRUE)
mtext("Length at capture [Lcap]", side=2, line=0, outer=TRUE)
par(op)
dev.off()Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: me nugget. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

KEGG enrichment analysis with latest online data using clusterProfiler

Sun, 2015-02-01 07:38

(This article was first published on YGC » R, and kindly contributed to R-bloggers)

KEGG.db is not updated since 2012. The data is now pretty old, but many of the Bioconductor packages still using it for KEGG annotation and enrichment analysis.

As pointed out in 'Are there too many biological databases', there is a problem that many out of date biological databases often don't get offline. This issue also exists in web-server or software that using out-of-date data. For example, the WEGO web-server stopped updating GO annotation data since 2009, and WEGO still online with many people using it. The biological story may changed totally if using a recently updated data. Seriously, We should keep an eye on this issue.

Now enrichKEGG function is reloaded with a new parameter use.KEGG.db. This parameter is by default setting to FALSE, and enrichKEGG function will download the latest KEGG data for enrichment analysis. If the parameter use.KEGG.db is explicitly setting to TRUE, it will use the KEGG.db which is still supported but not recommended.

With this new feature, supported species is unlimited if only there are KEGG annotations available in KEGG database. You can access the full list of species supported by KEGG via: http://www.genome.jp/kegg/catalog/org_list.html

Now the organism parameter in enrichKEGG should be abbreviation of academic name, for example 'hsa' for human and 'mmu' for mouse. It accepts any species listed in http://www.genome.jp/kegg/catalog/org_list.html.

In the current release version of clusterProfiler (in Bioconductor 3.0), enrichKEGG supports about 20 species, and the organism parameter accept common name of species, for instance "human" and "mouse". For these previously supported species, common name is also supported. So that you script is still working with new version of clusterProfiler. For other species, common name is not supported, since I don't want to maintain such a long mapping list with many species have no common name available and it may also introduce unexpected bugs.

Example 1: Using online KEGG annotation ?View Code RSPLUS1 2 3 4 5 6 7 8 library(DOSE) data(geneList) de <- names(geneList)[geneList > 1]   library(clusterProfiler) kk <- enrichKEGG(de, organism="hsa", pvalueCutoff=0.05, pAdjustMethod="BH", qvalueCutoff=0.1, readable=TRUE) head(summary(kk)) > head(summary(kk)) ID Description GeneRatio BgRatio hsa04110 hsa04110 Cell cycle 31/247 124/6861 hsa03030 hsa03030 DNA replication 9/247 36/6861 hsa04060 hsa04060 Cytokine-cytokine receptor interaction 25/247 265/6861 hsa04114 hsa04114 Oocyte meiosis 14/247 113/6861 hsa04115 hsa04115 p53 signaling pathway 10/247 68/6861 hsa04062 hsa04062 Chemokine signaling pathway 18/247 189/6861 pvalue p.adjust qvalue hsa04110 2.280256e-18 9.349050e-17 5.280593e-17 hsa03030 3.527197e-06 7.230753e-05 4.084123e-05 hsa04060 8.404037e-06 1.148552e-04 6.487326e-05 hsa04114 4.827484e-05 4.948171e-04 2.794859e-04 hsa04115 1.406946e-04 9.801620e-04 5.536217e-04 hsa04062 1.434383e-04 9.801620e-04 5.536217e-04 geneID hsa04110 CDC45/CDC20/CCNB2/CCNA2/CDK1/MAD2L1/TTK/CHEK1/CCNB1/MCM5/PTTG1/MCM2/CDC25A/CDC6/PLK1/BUB1B/ESPL1/CCNE1/ORC6/ORC1/CCNE2/MCM6/MCM4/DBF4/SKP2/CDC25B/BUB1/MYC/PCNA/E2F1/CDKN2A hsa03030 MCM5/MCM2/MCM6/MCM4/FEN1/RFC4/PCNA/RNASEH2A/DNA2 hsa04060 CXCL10/CXCL13/CXCL11/CXCL9/CCL18/IL1R2/CCL8/CXCL3/CCL20/IL12RB2/CXCL8/TNFRSF11A/CCL5/CXCR6/IL2RA/CCR1/CCL2/IL2RG/CCL4/CCR8/CCR7/GDF5/IL24/LTB/IL7 hsa04114 CDC20/CCNB2/CDK1/MAD2L1/CALML5/AURKA/CCNB1/PTTG1/PLK1/ESPL1/CCNE1/CCNE2/BUB1/FBXO5 hsa04115 CCNB2/RRM2/CDK1/CHEK1/CCNB1/GTSE1/CCNE1/CCNE2/SERPINB5/CDKN2A hsa04062 CXCL10/CXCL13/CXCL11/CXCL9/CCL18/CCL8/CXCL3/CCL20/CXCL8/CCL5/CXCR6/CCR1/STAT1/CCL2/CCL4/HCK/CCR8/CCR7 Count hsa04110 31 hsa03030 9 hsa04060 25 hsa04114 14 hsa04115 10 hsa04062 18 >

In the KEGG.db, there are only 5894 human genes annotated. With current online data, the number of annotated gene increase to 6861 as shown above and of course, p-values changed.

User should pay attention to another issue that readable parameter is only available for those species that has an annotation db. For example, for human we use org.Hs.eg.db for mapping gene ID to Symbol.

Example 2: enrichment analysis of species which are not previously supported

Here, I use a gene list of Streptococcus pneumoniae D39 to demonstrate using enrichKEGG with species that are not supported previously.

?View Code RSPLUS1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 > gene [1] "SPD_0065" "SPD_0071" "SPD_0293" "SPD_0295" "SPD_0296" "SPD_0297" [7] "SPD_0327" "SPD_0426" "SPD_0427" "SPD_0428" "SPD_0559" "SPD_0560" [13] "SPD_0561" "SPD_0562" "SPD_0580" "SPD_0789" "SPD_1046" "SPD_1047" [19] "SPD_1048" "SPD_1050" "SPD_1051" "SPD_1052" "SPD_1053" "SPD_1057" [25] "SPD_1326" "SPD_1432" "SPD_1534" "SPD_1582" "SPD_1612" "SPD_1613" [31] "SPD_1633" "SPD_1634" "SPD_1648" "SPD_1678" "SPD_1919" > spdKEGG = enrichKEGG(gene, organism="spd") > summary(spdKEGG) ID Description GeneRatio BgRatio spd00052 spd00052 Galactose metabolism 35/35 35/752 spd02060 spd02060 Phosphotransferase system (PTS) 12/35 47/752 spd01100 spd01100 Metabolic pathways 28/35 341/752 spd00520 spd00520 Amino sugar and nucleotide sugar metabolism 9/35 43/752 pvalue p.adjust qvalue spd00052 4.961477e-61 2.480739e-60 5.222608e-61 spd02060 2.470177e-07 6.175443e-07 1.300093e-07 spd01100 1.958319e-05 3.263866e-05 6.871296e-06 spd00520 6.534975e-05 8.168718e-05 1.719730e-05 geneID spd00052 SPD_0065/SPD_0071/SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0327/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1678/SPD_1919 spd02060 SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0426/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_1047/SPD_1048/SPD_1057 spd01100 SPD_0071/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919 spd00520 SPD_0580/SPD_1326/SPD_1432/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919 Count spd00052 35 spd02060 12 spd01100 28 spd00520 9 Summary

To summarize, clusterProfiler supports downloading the latest KEGG annotation for enrichment analysis and it supports all species that have KEGG annotation available in KEGG database.

To install the devel version of clusterProfiler, start R and enter the following command:

?View Code RSPLUS1 2 install.packages(c("DOSE", "clusterProfiler"), repos="http://www.bioconductor.org/packages/devel/bioc") Related Posts

To leave a comment for the author, please follow the link and comment on his blog: YGC » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Unemployment

Sun, 2015-02-01 03:58

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

Beginning 2013 I made a post plotting unemployment in Europe. Last year I did the same. Now that the unemployment numbers of December are on Eurostat, I am making them again. The plots shown are unemployment and its first derivative, both smoothed.
You can discuss these data extensively. Eurostat has a page for those who are interested. What I am personally interested in is a feeling for the trend where we are going in the future. In many places things are improving, for which youth unemployment is a good indicator. However, a number of changes have occurred in the last month(s). Cheap oil, dropping Euro prices, quantitative easing starting in the euro zone and ending in the USA, so I probably should not speculate.
Unemployment
First Derivative
 




CodeThis code has bee lightly adapted from last years. This had mostly to do with a change in countries, the Euro Area has grown, and it seems last year I managed to get all indicator variables in columns, while this year the values were in three age columns.
library(ggplot2)
library(KernSmooth)
library(plyr)
library(scales) # to access breaks/formatting functions

r1 <- read.csv("une_rt_m.csv",na.strings=':',check.names=FALSE)
names(r1)[2] <- 'GEO'
r1 <- reshape(r1,
    varying=list(names(r1)[3:5]),
    v.names='Value',
    timevar='AGE',
    idvar=names(r1)[1:2],
    times=names(r1)[3:5],
    direction='long')
rownames(r1) <- 1:nrow(r1)
levels(r1$GEO) <- sub(' countries)',')' ,levels(r1$GEO),fixed=TRUE)
levels(r1$GEO) <- sub('European Union','EU' ,levels(r1$GEO))
levels(r1$GEO)[levels(r1$GEO)=='Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19)'] <- "Euro Area"
levels(r1$GEO)[levels(r1$GEO)=='United Kingdom'] <- 'UK'
levels(r1$GEO)[levels(r1$GEO)=='United States'] <- 'US'
levels(r1$GEO)[levels(r1$GEO)=='Germany (until 1990 former territory of the FRG)'] <- 'Germany'
levels(r1$GEO)
grep('12|13|15|16|17|18|25|27',x=levels(r1$GEO),value=TRUE)
r1 <- r1[!(r1$GEO %in% grep('12|13|15|16|17|18|25|27',x=levels(r1$GEO),value=TRUE)),]
r1$GEO <- factor(r1$GEO)
r1$Age <- factor(r1$AGE,levels=unique(r1$AGE))
r1$Date <- as.Date(paste(gsub('M','-',as.character(r1$TIME)),'-01',sep=''))

#
maxi <- aggregate(r1$Value,by=list(GEO=r1$GEO),FUN=max,na.rm=TRUE)
parts <- data.frame(
    low = maxi$GEO[maxi$x<quantile(maxi$x,1/3)]
    ,middle = maxi$GEO[maxi$x>quantile(maxi$x,1/3) & maxi$x<quantile(maxi$x,2/3)]
    ,high = maxi$GEO[maxi$x>quantile(maxi$x,2/3)]
)
#ggplot(r1[r1$GEO %in% low,],aes(x=Date,y=Value,colour=Age)) +
#        facet_wrap( ~ GEO, drop=TRUE) +
#        geom_line()  +
#        theme(legend.position = "bottom")
#        ylab('% Unemployment') + xlab('Year')


r1$class <- interaction(r1$GEO,r1$Age)
head(r1)
r3 <- r1[complete.cases(r1),]
r3$class <- factor(r3$class)
Perc <- ddply(.data=r3,.variables=.(class),
    function(piece,...) {
        lp <- locpoly(x=as.numeric(piece$Date),y=piece$Value,
            drv=0,bandwidth=90)
        sdf <- data.frame(Date=as.Date(lp$x,origin='1970-01-01'),
            sPerc=lp$y,Age=piece$Age[1],GEO=piece$GEO[1])}
    ,.inform=FALSE
)
for (i in c('low','middle','high')) {
    png(paste(i,'.png',sep=''))
    print(
        ggplot(Perc[Perc$GEO %in% parts[,i] ,],
                aes(x=Date,y=sPerc,colour=Age)) +
            facet_wrap( ~ GEO, drop=TRUE) +
            geom_line()  +
            theme(legend.position = "bottom")+
            ylab('% Unemployment') + xlab('Year') +
            scale_x_date(breaks = date_breaks("5 years"),
                labels = date_format("%y"))
    )
    dev.off()
}

dPerc <- ddply(.data=r3,.variables=.(class),
    function(piece,...) {
        lp <- locpoly(x=as.numeric(piece$Date),y=piece$Value,
            drv=1,bandwidth=365/2)
        sdf <- data.frame(Date=as.Date(lp$x,origin='1970-01-01'),
            dPerc=lp$y,Age=piece$Age[1],GEO=piece$GEO[1])}
    ,.inform=FALSE
)

for (i in c('low','middle','high')) {
    png(paste('d',i,'.png',sep=''))
    print(
        ggplot(dPerc[dPerc$GEO %in% parts[,i] ,],
                aes(x=Date,y=dPerc,colour=Age)) +
            facet_wrap( ~ GEO, drop=TRUE) +
            geom_line()  +
            theme(legend.position = "bottom")+
            ylab('Change in % Unemployment') + xlab('Year')+
            scale_x_date(breaks = date_breaks("5 years"),
                labels = date_format("%y"))
    )
    dev.off()
}

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The leaflet package for online mapping in R

Sat, 2015-01-31 19:00

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

It has been possible for some years to launch a web map from within R. A number of packages for doing this are available, including:

  • RgoogleMaps, an interface to the Google Maps api
  • leafletR, an early package for creating Leaflet maps with R
  • rCharts, which can be used as a basis for webmaps

In this tutorial we use the new RStudio-supported leaflet R package. We use this package, an R interface to the JavaScript mapping library of the same name because:

  • leaflet is supported by RStudio, who have a track strong track record of creating amazing R packages
  • leaflet appears to provide the simplest, fastest way to host interactive maps online in R, requiring only 2 lines of code for one web map! (as you’ll see below)
  • leaflet is shiny. Shiny in the literal sense of the word (a new and fresh approach to web mapping in R) but also in the sense that it works well with the R package shiny.

The best tutorial resource I have found on leaflet is this vignette by Joe Cheng and Yihui Xie. Building on their excellent description, this article explains some of the fundamentals of the package.

Installing leaflet

Because leaflet is new, it’s not yet on CRAN. Even when it does appear, installing from github may be a good idea, to ensure you have access to the latest features and bug fixes. Here’s how:

# Install leaflet package if(!require(leaflet)) install_github("rstudio/leaflet") A first web map with leaflet

To create an interactive web map with leaflet is incredibly easy. Having installed the package try this single line of code:

# Plot a default web map (brackets display the result) (m <- leaflet() %>% addTiles()) img <- readPNG("~/repos/Creating-maps-in-R/figure//shiny_world.png") grid.raster(img)

Adding basic features with %>%

Adding basic features to your webmap is easy. The %>% ‘pipe’ operator used extensively in dplyr and developed for the maggrittr package means we can finally escape from dozens of nested bracketted commands. (If you use RStudio, I suggest trying the new shortcut Ctl+Shift+M to produce this wonderful operator.) leaflet settings and functionality can thus be added sequentially, without requiring any additional brackets!

To add a location to the map m, for example, we can simply pipe m into the function setView():

m %>% setView(lng = -1.5, lat = 53.4, zoom = 10) # set centre and extent of map

This way we can gradually add elements to our map, one-by-one:

(m2 <- m %>% setView(-1.5, 53.4, 10) %>% # map location addMarkers(-1.4, 53.5) %>% # add a marker addPopups(-1.6, 53.3, popup = "Hello Sheffield!") %>% # popup # add som circles: addCircles(color = "black", runif(90, -2, -1), runif(90, 53, 54), runif(90, 10, 500))) Adding data

In the previous example, we added some random data that we created locally, inside the function call. How do we use real, large datasets in leaflet? The package provides 3 options:

  • Data from base R: lat/long matrix or data.frame
  • Data from sp such as SpatialPoints and SpatialPolygons
  • Data from maps

Let’s try adding a bicycle route, one that I followed with some friends to move from Sheffield to my current home of Leeds. First download some data:

url = "https://github.com/Robinlovelace/sdvwR/raw/master/data/gps-trace.gpx" download.file(url, destfile = "shef2leeds.gpx", method = "wget")

Now we can load this as a SpatialLinesDataFrame and display in leaflet:

library(rgdal) shef2leeds <- readOGR("shef2leeds.gpx", layer = "tracks") m2 %>% setView(-1.5, 53.4, 9) %>% # map location addPolylines(data = shef2leeds, color = "red", weight = 4)

Note in the above example that we had to use the argument data = to refer to our spatial object: it cannot simply be inserted without specifying what it is. The data argument can also be placed inside the initial leaflet() function call.

That was quite a painless process that would many more lines of code if you were to JavaScript. But not as painless as the bicycle trip itself, which involved few lines of code still: 0! This can be seen in the following video.

Shiny integration

leaflet is developed by the same team who develop shiny so the two are well integrated. Below is a very simple example, modified slightly from the package’s vignette:

library(shiny) shinyApp( ui = fluidPage(leafletOutput('myMap')), server = function(input, output) { # download and load data url = "https://github.com/Robinlovelace/sdvwR/raw/master/data/gps-trace.gpx" download.file(url, destfile = "shef2leeds.gpx", method = "wget", ) library(rgdal) shef2leeds <- readOGR("shef2leeds.gpx", layer = "tracks") map = leaflet() %>% addTiles() %>% setView(-1.5, 53.4, 9) %>% addPolylines(data = shef2leeds, color = "red", weight = 4) output$myMap = renderLeaflet(map) } ) Applications

Clearly leaflet is a powerful and flexible R package. If I were to offer one critique, it would be that I could find no easy way to allow the user to query the data objects loaded. plotly, for example, provides a description of any visual object the user clicks on. The datashine commuter visualisation, for example allows users to click on any point, resulting in a burst of lines emenating from it. This would also be possible in leaflet/shiny, but the best implementation is not immediately clear, to me at least!

The wider context of this article is the pressing need for better transport planning decision making, to enable a transition away from fossil fuels. To this end, the ‘propensity to cycle’ project, funded by the UK’s Department for Transport, seeks to create an interactive tool to identify where new bicycle paths are most needed. There are clearly many other uses for R’s leaflet package: what will you use it for? Let me know at @robinlovelace.

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Bio7 2.0 for Linux Released!

Sat, 2015-01-31 17:05

(This article was first published on » R, and kindly contributed to R-bloggers)

01.02.2015

I released the Linux version of Bio7 2.0 (see screenshot below).

For an overview of the new Bio7 features follow the link: Bio7 2.0 new features

Added Linux features:

  • A new Java menu action (in the Scripts menu) is available to start a Py4J server which can communicate to Java and CPython (if enabled in the native preferences of Bio7 and Py4J is installed in the local Python distribution). With this feature it is, e.g., possible to transfer ImageJ values to CPython or call the Bio7 Java API.
  • Improved the embedded native pseudo terminal to send signals to a process tree (e.g., SIGINT = STRG+C).
  • Added an option in the native preferences to interpret a Python script with a a native CPython interpreter >=3.0.

Installation:

Download Bio7 and simply unzip the Bio7 archive file in your preferred location. Bio7 comes bundled with a JRE so you don’t need to install Java separately. The Bio7 application was tested on Ubuntu 14.04 (minimum requirement).

R features:

For the Linux version of Bio7 2.0 R and Rserve have to be installed. For the use in Bio7 Rserve has to be compiled with a special flag to enable the cooperative mode (see below).

In a shell simply execute:

sudo PKG_CPPFLAGS=-DCOOPERATIVE R CMD INSTALL Rserve_1.8-1.tar.gz

The command will compile and install the Rserve package in your default Linux R application. This is necessary to share the R workspace when you switch from a local Rserve connection to the native Bio7 R console (see this video for an explanation of the new connection mode in Bio7).

Please note that the default R package location of Bio7 is: /usr/lib/R/site-library

The location (and the path to R) can be changed in the R preferences of Bio7 (e.g., menu: R->Preferences).

 

 

 

 

 

 

 

To leave a comment for the author, please follow the link and comment on his blog: » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs