R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 5 hours 21 min ago

Scraping jQuery DataTable Programmatic JSON with R

Mon, 2015-05-18 07:35

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

School of Data had a recent post how to copy “every item” from a multi-page list. While their post did provide a neat hack, their “words of warning” are definitely missing some items and the overall methodology can be improved upon with some basic R scripting.

First, the technique they outlined relies heavily on how parameters are passed and handled by the server the form is connected to. The manual technique is not guaranteed to work across all types of forms nor even those with a “count” popup. I can see this potentially frustrating many budding data janitors.

Second, this particular technique and example really centers around jQuery DataTables. While their display style can be highly customized, it’s usually pretty easy to determine if they are being used both visually:

(i.e. by the controls & style of the controls available) and in the source:

The URLs might be local or on a common content delivery network, but it should be pretty easy to determine when a jQuery DataTable is in use. Once you do, you should also be able to tell if it’s calling out to a URL for some JSON to populate the structure.

Here, I just used Chrome’s Developer Tools to look a the responses coming back from the server. That’s a pretty ugly GET request, but we can see the query parameters a bit better if we scroll down:

These definitely track well with the jQuery DataTable server-side documentation so we should be able to use this to our advantage to avoid the pitfalls of overwhelming the browser with HTML entities and doing cut & paste to save out the list.

Getting the Data With R

The R code to get this same data is about as simple as it gets. All you need is the data source URL, with a modified length query parameter. After that’s it’s just a few lines of code:

library(httr) library(jsonlite) library(dplyr) # for glimpse   url <- "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=1&columns%5B0%5D%5Bdata%5D=box_art&columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=title&columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=year&columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=rating&columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=category&columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=true&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=available&columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=true&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=director&columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=true&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B7%5D%5Bdata%5D=cast&columns%5B7%5D%5Bname%5D=&columns%5B7%5D%5Bsearchable%5D=true&columns%5B7%5D%5Borderable%5D=true&columns%5B7%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B7%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=5&order%5B0%5D%5Bdir%5D=desc&start=0&length=7448&search%5Bvalue%5D=&search%5Bregex%5D=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1431945465056"   resp <- GET(url)

Normally we would be able to do:

content(resp, as="parsed")

but this server did not set the Content-Type of the response well, so we have to do it by hand with the jsonlite package:

recs <- fromJSON(content(resp, as="text"))

The recs variable is now an R list with a structure that (thankfully) fully represents the expected server response:

## List of 4 ## $ draw : int 1 ## $ recordsTotal : int 7448 ## $ recordsFiltered: int 7448 ## $ data :'data.frame': 7448 obs. of 9 variables: ## ..$ box_art : chr [1:7448] "<img src="http://cdn1.nflximg.net/images/9159/12119159.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn1.nflximg.net/images/6195/20866195.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn1.nflximg.net/images/3735/2243735.jpg" width="55" alt="Thumbnail">" "<img src="http://cdn0.nflximg.net/images/2668/21112668.jpg" width="55" alt="Thumbnail">" ... ## ..$ title : chr [1:7448] "In the Bedroom" "Wolfy: The Incredible Secret" "Bratz: Diamondz" "Tinker Bell and the Legend of the NeverBeast" ... ## ..$ year : chr [1:7448] "2001" "2013" "2006" "2015" ... ## ..$ rating : chr [1:7448] "3.3" "2.5" "3.6" "4" ... ## ..$ category : chr [1:7448] "<a href="http://www.allflicks.net/category/thrillers/">Thrillers</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" "<a href="http://www.allflicks.net/category/children-and-family-movies/">Children & Family Movies</a>" ... ## ..$ available: chr [1:7448] "17 May 2015" "17 May 2015" "17 May 2015" "17 May 2015" ... ## ..$ cast : chr [1:7448] "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa Tomei, William Mapother, William Wise, Celia Weston, Karen Allen, Frank T. Well"| __truncated__ "Rafael Marin, Christian Vandepas, Gerald Owens, Yamile Vasquez, Pilar Uribe, James Carrey, Rebecca Jimenez, Joshua Jean-Baptist"| __truncated__ "Olivia Hack, Soleil Moon Frye, Tia Mowry-Hardrict, Dionne Quan, Wendie Malick, Lacey Chabert, Kaley Cuoco, Charles Adler" "Ginnifer Goodwin, Mae Whitman, Rosario Dawson, Lucy Liu, Pamela Adlon, Raven-Symoné, Megan Hilty" ... ## ..$ director : chr [1:7448] "Todd Field" "Éric Omond" "Mucci Fassett, Nico Rijgersberg" "Steve Loter" ... ## ..$ id : chr [1:7448] "60022258" "70302834" "70053695" "80028529" ...

We see there is a data.frame in there with the expected # of records. We can also use glimpse from dplyr to see the data table a bit better:

glimpse(recs$data)   ## Observations: 7448 ## Variables: ## $ box_art (chr) "<img src="http://cdn1.nflximg.net/images/9159/12... ## $ title (chr) "In the Bedroom", "Wolfy: The Incredible Secret", ... ## $ year (chr) "2001", "2013", "2006", "2015", "1993", "2013", "2... ## $ rating (chr) "3.3", "2.5", "3.6", "4", "3.5", "3.1", "3.3", "4.... ## $ category (chr) "<a href="http://www.allflicks.net/category/thril... ## $ available (chr) "17 May 2015", "17 May 2015", "17 May 2015", "17 M... ## $ cast (chr) "Tom Wilkinson, Sissy Spacek, Nick Stahl, Marisa T... ## $ director (chr) "Todd Field", "Éric Omond", "Mucci Fassett, Nico R... ## $ id (chr) "60022258", "70302834", "70053695", "80028529", "8...

Now, we can use that in any R workflow or use write it out as a CSV (or other format) for other workflows to use. No browsers were crashed and we have code we run again to scrape the site (i.e. when the add more movies to the database) vs a manual cut & paste workflow.

Many of the concepts in this post can be applied to other data table displays (i.e. those not based on jQuery DataTable), but you’ll have to get comfortable with the developer tools view of your favorite browser.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Data Science Tweet Analysis – What tools are people talking about?

Mon, 2015-05-18 07:12

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Chris Musselle PhD, Mango UK

At ­Mango we use a variety of tools in-house to address our clients’ business needs and when these fall within the data science arena, the main candidates we turn to are either the R or Python programming languages.

The question as to which is the “best” language for doing data science is a hotly debated topic ([link] [link] [link] [link]), with both languages having their pros and cons. However the capabilities of each are expanding all the time thanks to continuous open source development in both areas.

With both languages becoming increasingly popular for data analysis, we thought it would be interesting to track current trends and see what people are saying about these and other tools for data science on Twitter.

This post is the first of three that will look into the results of our analysis, but first a bit of background.

 

Twitter Analysis

Today many companies are routinely drawing on social media data sources such as Twitter and Facebook to enhance their business decision making in a number of ways. This type of analysis can be a component of market research, an avenue for collecting customer feedback or a way to promote campaigns and conduct targeted advertising.

To facilitate this type of analysis, Twitter offer a variety of Application Programming Interfaces or APIs that enable an application to programmatically interact with the services provided by Twitter. These APIs currently come in three main flavours.

  • REST API – Allows automated access to searching, reading and writing tweets
  • Streaming API – Allows tracking of multiple users and or search terms in near real time, though results may only be a sample
  • Twitter Firehose – Allows tracking of all tweets past and future, no limits on search results returned.

These different approaches have different trade-offs. The REST API can only search past tweets, and is limited in how far back you can search as Twitter only keeps the last couple of weeks of data. The Streaming API tracks tweets as they happen, but Twitter only guarantees a sample of all current tweets will be collected [link]. This means that if your search term is very generic and matches a lot of tweets, then not all of these tweets will be returned [link].

The Twitter Firehose addresses the shortcomings of the previous two APIs, but at quite a substantial cost, whereas the other two are free to use. There are also a growing number of third party intermediaries that have access to the Twitter Firehose, and sell on the Twitter data they collect [link].

 

Our Approach

We chose to use the Streaming API to collect tweets containing the hashtags “python” and/or “rstats” and/or “datascience” over a 10 day period.

To harvest the data, a python script was created to utilize the API and append tweets to a single file. Command line tools such as cvskit and jq were then used to clean and preprocess the data, with the analysis done in Python using the pandas library.

 

Preliminary Results: Hashtag Counts and Co-occurrence

From Figure 1, it is immediately obvious that “python” and “datascience” were more popular hashtags than “rstats” over the time period sampled. Though interestingly, there was little overlap between these groups.

Figure 1: Venn diagram of tweet counts by hashtag

 

This suggests that the majority of tweets that mentioned these subjects either did so in isolation or alongside other hashtags that were not tracked. We can get a sense of which is the case by looking at a count of the total number of unique hashtags that occurred alongside each tracked hashtag, this is shown in Table 1.

Table 1: Total unique hashtags used per tracked subset

 

These counts show that the “python” hashtag is mentioned alongside a lot more other topics/hashtags than “rstats” and “datascience”. This makes sense when you consider that Python is a general purpose programming language, and as such has a broader range across application domains than R, which is more statistically focused. In between these is the “datascience” hashtag, a term that relates to many different skillsets and technologies, and so we would expect the number of unique hashtag co-occurrences to be quite high.

 

So what are people mentioning alongside these hashtags if not these technologies?

Table 2 shows the top hashtags mentioned alongside the three tracked hashtags. Here the numbers in the header are the total number of tweets that contained the tracked hashtag term, plus at least one other hashtag. So the vast majority of tweets occur with multiple hashtags As can be seen all three subjects were commonly mentioned alongside other hashtags.

Table 2: Table of most frequent co-occurring hashtags with tracked keywords. Numbers in the header are the total number of tweets containing at least one other hashtag to the one tracked.

As we may expect, many co-occurring hashtags are closely related, though in general it’s interesting to see that “datascience” co-occurs with many more general concepts and or ‘buzzwords’ frequently, with technologies mentioned further down the list.

Python on the other hand occurs frequently alongside other web technologies, as well as “careers” and “hiring”, which may reflect a high demand for jobs that use Python and these related technologies for web development. On the other hand it may simply be that many good web developers are active on Twitter, and as such recruitment companies favor this medium of advertising when trying to fill web development positions.

It’s interesting that tweets with the “Rstats” hashtags mentioned “datascience” and “bigdata” more than any other, likely reflecting the increasing trends in using R in this arena. The other co-occurring hashtags for R can be grouped into: those that relate to its domain specific use (“statistics”, “analytics”, “machinelearning” etc.); possible ways of integrating it with other language (“python”, “excel”, “d3js”); and other ways of referencing R itself (“r”, “rlang”)!

 

Summary

So from looking at the counts of hashtags and their co-occurrences, it looks like:

  • Tweets containing Python or data science were roughly 5 times more frequent than those containing Rstats. There was also little relative overlap in the three hashtags tracked.
  • Tweets containing Python also mention a broader range of other topics, while R is more focused around data science, statistics and analytics.
  • Tweets mentioning data science most commonly include hashtags for general analytics concepts and ‘buzzwords’, with specific technologies only occasionally mentioned.
  • Tweets mentioning Python most commonly include hashtags for web development technologies and are likely the result of a high volume of recruitment advertising.

 

Future Work

So far we have only looked at the hashtag contents of the tweet and there is much more data contained within that can be analysed. Two other key components are the user mentions and the URLs in the message. Future posts will look into both of these to investigate the content being shared, along with who is retweeting/being retweeted by whom.

 

To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Introductory Time-Series analysis of US Environmental Protection Agency (EPA) pollution data

Mon, 2015-05-18 07:05

(This article was first published on R Video tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

Download EPA air pollution data
The US Environmental Protection Agency (EPA) provides tons of free data about air pollution and other weather measurements through their website. An overview of their offer is available here: http://www.epa.gov/airdata/

The data are provided in hourly, daily and annual averages for the following parameters:
Ozone, SO2, CO,NO2, Pm 2.5 FRM/FEM Mass, Pm2.5 non FRM/FEM Mass, PM10, Wind, Temperature, Barometric Pressure, RH and Dewpoint, HAPs (Hazardous Air Pollutants), VOCs (Volatile Organic Compounds) and Lead.

All the files are accessible from this page:
http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html

The web links to download the zip files are very similar to each other, they have an initial starting URL: http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/
and then the name of the file has the following format: type_property_year.zip
The type can be: hourly, daily or annual. The properties are sometimes written as text and sometimes using a numeric ID. Everything is separated by an underscore.

Since these files are identified by consistent URLs I created a function in R that takes year, property and type as arguments, downloads and unzip the data (in the working directory) and read the csv.
To complete this experiment we would need the following packages: sp, rasterxts, plotGoogleMaps
The code for this function is the following:

download.EPA <- function(year, property = c("ozone","so2","co","no2","pm25.frm","pm25","pm10","wind","temp","pressure","dewpoint","hap","voc","lead"), type=c("hourly","daily","annual")){
if(property=="ozone"){PROP="44201"}
if(property=="so2"){PROP="42401"}
if(property=="co"){PROP="42101"}
if(property=="no2"){PROP="42602"}
 
if(property=="pm25.frm"){PROP="88101"}
if(property=="pm25"){PROP="88502"}
if(property=="pm10"){PROP="81102"}
 
if(property=="wind"){PROP="WIND"}
if(property=="temp"){PROP="TEMP"}
if(property=="pressure"){PROP="PRESS"}
if(property=="dewpoint"){PROP="RH_DP"}
if(property=="hap"){PROP="HAPS"}
if(property=="voc"){PROP="VOCS"}
if(property=="lead"){PROP="lead"}
 
URL <- paste0("http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/",type,"_",PROP,"_",year,".zip")
download.file(URL,destfile=paste0(type,"_",PROP,"_",year,".zip"))
unzip(paste0(type,"_",PROP,"_",year,".zip"),exdir=paste0(getwd()))
read.table(paste0(type,"_",PROP,"_",year,".csv"),sep=",",header=T)
}
This function can be used as follow to create a data.frame with exactly the data we are looking for:

data <- download.EPA(year=2013,property="ozone",type="daily")
This creates a data.frame object with the following characteristics:

> str(data)
'data.frame': 390491 obs. of 28 variables:
$ State.Code : int 1 1 1 1 1 1 1 1 1 1 ...
$ County.Code : int 3 3 3 3 3 3 3 3 3 3 ...
$ Site.Num : int 10 10 10 10 10 10 10 10 10 10 ...
$ Parameter.Code : int 44201 44201 44201 44201 44201 44201 44201 44201 44201 44201 ...
$ POC : int 1 1 1 1 1 1 1 1 1 1 ...
$ Latitude : num 30.5 30.5 30.5 30.5 30.5 ...
$ Longitude : num -87.9 -87.9 -87.9 -87.9 -87.9 ...
$ Datum : Factor w/ 4 levels "NAD27","NAD83",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Parameter.Name : Factor w/ 1 level "Ozone": 1 1 1 1 1 1 1 1 1 1 ...
$ Sample.Duration : Factor w/ 1 level "8-HR RUN AVG BEGIN HOUR": 1 1 1 1 1 1 1 1 1 1 ...
$ Pollutant.Standard : Factor w/ 1 level "Ozone 8-Hour 2008": 1 1 1 1 1 1 1 1 1 1 ...
$ Date.Local : Factor w/ 365 levels "2013-01-01","2013-01-02",..: 59 60 61 62 63 64 65 66 67 68 ...
$ Units.of.Measure : Factor w/ 1 level "Parts per million": 1 1 1 1 1 1 1 1 1 1 ...
$ Event.Type : Factor w/ 3 levels "Excluded","Included",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Observation.Count : int 1 24 24 24 24 24 24 24 24 24 ...
$ Observation.Percent: num 4 100 100 100 100 100 100 100 100 100 ...
$ Arithmetic.Mean : num 0.03 0.0364 0.0344 0.0288 0.0345 ...
$ X1st.Max.Value : num 0.03 0.044 0.036 0.042 0.045 0.045 0.045 0.048 0.057 0.059 ...
$ X1st.Max.Hour : int 23 10 18 10 9 10 11 12 12 10 ...
$ AQI : int 25 37 31 36 38 38 38 41 48 50 ...
$ Method.Name : Factor w/ 1 level " - ": 1 1 1 1 1 1 1 1 1 1 ...
$ Local.Site.Name : Factor w/ 1182 levels ""," 201 CLINTON ROAD, JACKSON",..: 353 353 353 353 353 353 353 353 353 353 ...
$ Address : Factor w/ 1313 levels " Edgewood Chemical Biological Center (APG), Waehli Road",..: 907 907 907 907 907 907 907 907 907 907 ...
$ State.Name : Factor w/ 53 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
$ County.Name : Factor w/ 631 levels "Abbeville","Ada",..: 32 32 32 32 32 32 32 32 32 32 ...
$ City.Name : Factor w/ 735 levels "Adams","Air Force Academy",..: 221 221 221 221 221 221 221 221 221 221 ...
$ CBSA.Name : Factor w/ 414 levels "","Adrian, MI",..: 94 94 94 94 94 94 94 94 94 94 ...
$ Date.of.Last.Change: Factor w/ 169 levels "2013-05-17","2013-07-01",..: 125 125 125 125 125 125 125 125 125 125 ...
The csv file contains a long series of columns that should again be consistent among all the dataset cited above, even though it changes slightly between hourly, daily and annual average.
A complete list of the meaning of all the columns is available here:
aqsdr1.epa.gov/aqsweb/aqstmp/airdata/FileFormats.html

Some of the columns are self explanatory, such as the various geographical names associated with the location of the measuring stations. For this analysis we are particularly interested in the address (that we can use to extract data from individual stations), event type (that tells us if extreme weather events are part of the averages), the date and the actual data (available in the column Arithmetic.Mean).


Extracting data for individual stations
The data.frame we loaded using the function download.EPA contains Ozone measurements from all over the country. To perform any kind of analysis we first need a way to identify and then subset the stations we are interested in.
For doing so I though about using one of the interactive visualization I presented in the previous post. To use that we first need to transform the csv into a spatial object. We can use the following loop to achieve that:

locations <- data.frame(ID=numeric(),LON=numeric(),LAT=numeric(),OZONE=numeric(),AQI=numeric())
for(i in unique(data$Address)){
dat <- data[data$Address==i,]
locations[which(i==unique(data$Address)),] <- data.frame(which(i==unique(data$Address)),unique(dat$Longitude),unique(dat$Latitude),round(mean(dat$Arithmetic.Mean,na.rm=T),2),round(mean(dat$AQI,na.rm=T),0))
}
 
locations$ADDRESS <- unique(data$Address)
 
coordinates(locations)=~LON+LAT
projection(locations)=CRS("+init=epsg:4326")
First of all we create an empty data.frame declaring the type of variable for each column. With this loop we can eliminate all the information we do not need from the dataset and keep the one we want to show and analyse. In this case I kept Ozone and the Air Quality Index (AQI), but you can clearly include more if you wish.
In the loop we iterate through the addresses of each EPA station, for each we first subset the main dataset to keep only the data related to that station and then we fill the data.frame with the coordinates of the station and the mean values of Ozone and AQI.
When the loop is over (it may take a while!), we can add the addresses to it and transform it into a SpatialObject. We also need to declare the projection of the coordinates, which in WGS84.
Now we are ready to create an interactive map using the package plotGoogleMaps and the Google Maps API. We can simply use the following line:

map <- plotGoogleMaps(locations,zcol="AQI",filename="EPA_GoogleMaps.html",layerName="EPA Stations")
This creates a map with a marker for each EPA station, coloured with the mean AQI. If we click on a marker we can see the ID of the station, the mean Ozone value and the address (below). The EPA map I created is shown at this link: EPA_GoogleMaps


From this map we can obtain information regarding the EPA stations, which we can use to extract values for individual stations from the dataset.
For example, we can extract values using the ID we created in the loop or the address of the station, which is also available on the Google Map, using the code below:

ID = 135
Ozone <- data[paste(data$Address)==unique(data$Address)[ID]&paste(data$Event.Type)=="None",]
 
ADDRESS = "966 W 32ND"
Ozone <- data[paste(data$Address)==ADDRESS&paste(data$Event.Type)=="None",]
Once we have extracted only data for a single station we can proceed with the time-series analysis.


Time-Series Analysis
There are two ways to tell R that a particular vector or data.frame is in fact a time-series. We have the function ts available in the package basic and the function xts, available in the package xts.
I will first analyse how to use xts, since this is probably the best way of handling time-series.
The first thing we need to do is make sure that our data have a column of class Date. This is done by transforming the current date values into the proper class. The EPA datasets has a Date.local column that R reads as factors:

> str(Ozone$Date.Local)
Factor w/ 365 levels "2013-01-01","2013-01-02",..: 90 91 92 93 94 95 96 97 98 99 ...
We can transform this into the class Date using the following line, which creates a new column named DATE in the Ozone object:

Ozone$DATE <- as.Date(Ozone$Date.Local)
Now we can use the function xts to create a time-series object:

Ozone.TS <- xts(x=Ozone$Arithmetic.Mean,order.by=Ozone$DATE)
plot(Ozone.TS,main="Ozone Data",sub="Year 2013")
The first line creates the time-series using the Ozone data and the DATE column we created above. The second line plots the time-series and produces the image below:



To extract the dates of the object Ozone we can use the function index and we can use the function coredata to extract the ozone values.

index(Ozone.TS)
Date[1:183], format: "2013-03-31" "2013-04-01" "2013-04-02" "2013-04-03" ...
 
coredata(Ozone.TS)
num [1:183, 1] 0.044 0.0462 0.0446 0.0383 0.0469 ...

Subsetting the time-series is super easy in the package xts, as you can see from the code below:

Ozone.TS['2013-05-06'] #Selection of a single day
 
Ozone.TS['2013-03'] #Selection of March data
 
Ozone.TS['2013-05/2013-07'] #Selection by time range
The first line extracts values for a single day (remember that the format is year-month-day); the second extracts values from the month of March. We can use the same method to extract values from one particular year, if we have time-series with multiple years.
The last line extracts values in a particular time range, notice the use of the forward slash to divide the start and end of the range.

We can also extract values by attributes, using the functions index and coredata. For example, if we need to know which days the ozone level was above 0.03 ppm we can simply use the following line:

index(Ozone.TS[coredata(Ozone.TS)>0.03,])
The package xts features some handy function to apply custom functions to specific time intervals along the time-series. These functions are: apply.weeklyapply.monthlyapply.quarterly and apply.yearly

The use of these functions is similar to the use of the apply function. Let us look at the example below to clarify:

apply.weekly(Ozone.TS,FUN=mean)
apply.monthly(Ozone.TS,FUN=max)
The first line calculates the mean value of ozone for each week, while the second computes the maximum value for each month. As for the function apply we are not constrained to apply functions that are available in R, but we can define our own:

apply.monthly(Ozone.TS,FUN=function(x) {sd(x)/sqrt(length(x))})
in this case for example we can define a function to calculate the standard error of the mean for each month.

We can use these functions to create a simple plot that shows averages for defined time intervals with the following code:

plot(Ozone.TS,main="Ozone Data",sub="Year 2013")
lines(apply.weekly(Ozone.TS,FUN=mean),col="red")
lines(apply.monthly(Ozone.TS,FUN=mean),col="blue")
lines(apply.quarterly(Ozone.TS,FUN=mean),col="green")
lines(apply.yearly(Ozone.TS,FUN=mean),col="pink")
These lines return the following plot:


From this image it is clear that ozone presents a general decreasing trend over 2013 for this particular station. However, in R there are more precise ways of assessing the trend and seasonality of time-series.

Trends
Let us create another example where we use again the function download.EPA to download NO2 data over 3 years and then assess their trends.

NO2.2013.DATA <- download.EPA(year=2013,property="no2",type="daily")
NO2.2012.DATA <- download.EPA(year=2012,property="no2",type="daily")
NO2.2011.DATA <- download.EPA(year=2011,property="no2",type="daily")
 
ADDRESS = "2 miles south of Ouray and south of the White and Green River confluence" #Copied and pasted from the interactive map
NO2.2013 <- NO2.2013.DATA[paste(NO2.2013.DATA$Address)==ADDRESS&paste(NO2.2013.DATA$Event.Type)=="None",]
NO2.2012 <- NO2.2012.DATA[paste(NO2.2012.DATA$Address)==ADDRESS&paste(NO2.2012.DATA$Event.Type)=="None",]
NO2.2011 <- NO2.2011.DATA[paste(NO2.2011.DATA$Address)==ADDRESS&paste(NO2.2011.DATA$Event.Type)=="None",]
 
 
NO2.TS <- ts(c(NO2.2011$Arithmetic.Mean,NO2.2012$Arithmetic.Mean,NO2.2013$Arithmetic.Mean),frequency=365,start=c(2011,1))
The first lines should be clear from we said before. The only change is that the time-series is created using the function ts, available in base R. With ts we do not have to create a column of class Date in our dataset, but we can just specify the starting point of the time series (using the option start, which in this case is January 2011) and the number of samples per year with the option frequency. In this case the data were collected daily so the number of times per year is 365; if we had a time-series with data collected monthly we would specify a frequency of 12.

We can decompose the time-series using the function decompose, which is based on moving averages:

dec <- decompose(NO2.TS)
plot(dec)
The related plot is presented below:



There is also another method, based on the loess smoother (for more info: Article) that can be accessed using the function stl:

STL <- stl(NO2.TS,"periodic")
plot(STL)
This function is able to calculate the trend along the whole length of the time-series:



Conclusions
This example shows how to download and access the open pollution data for the US available from the EPA directly from R.
Moreover we have seen here how to map the locations of the stations and subset the dataset. We also looked at ways to perform some introductory time-series analysis on pollution data.
For more information and material regarding time-series analysis please refer to the following references:

A Little Book of R For Time Series

Analysis of integrated and cointegrated time series with R

Introductory time series with R



R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R Video tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

DataCamp R Certifications – Now Available on Your LinkedIn Profile

Mon, 2015-05-18 06:11

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

You can now quickly and easily integrate your DataCamp.com course completion certificates with your LinkedIn profile. Showcase any completed DataCamp course on your profile, including our free Introduction to R course.

Share with managers, recruiters and potential clients the skills and knowledge you’ve acquired at DataCamp. In just a few clicks, you can show off your R and data science skills to your entire professional network. Just follow these easy steps:

  1. Select “My Accomplishments” in the top-right corner, and click add to LinkedIn profile. 
  2. You will go automagically to your LinkedIn profile, where you just have to click “Add to Profile”.

  3. Scroll down to Certifications, and there it is! Your personalized certification. 

The post DataCamp R Certifications – Now Available on Your LinkedIn Profile appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Analyzing R-Bloggers’ posts via Twitter

Mon, 2015-05-18 02:30

(This article was first published on Dean Attali's R Blog, and kindly contributed to R-bloggers)

For those who don’t know, every time a new blog post gets added to R-Bloggers, it gets a corresponding tweet by @Rbloggers, which gets seen by Rbloggers’ ~20k followers fairly fast. And every time my post gets published, I can’t help but check up on how many people gave that tweet some Twitter love, ie. “favorite”d or “retweet”ed it. It’s even more exciting than getting a Facebook “like” on a photo from Costa Rica!

Seeing all these tweets and how some tweets get much more attention than others has gotten me thinking. Are there some power users who post almost all the content, or do many blogs contribute equally? Which posts were the most shared? Which blog produces the highest quality posts consistently? Are there more posts during the weekdays then weekends? And of course the holy grail of bloggers - is there a day when it’s better to post to get more shares?

To answer these questions, I of course turned to R. I used the twitteR package to get information about the latest 3200 tweets made by Rbloggers, Hadley’s httr to scrape each blog post to get the post’s author, and ggplot2 to visualize some cool aspects of the data. Unfortunately Twitter does not allow us to fetch any tweets older than that (if you know of a workaround, please let me know), so the data here will be looking at tweets made from September 2013 until now (mid May 2015). That’s actually a nice start date because it’s exactly when I started grad school and when I first used R. So you can think of this analysis as “R-Bloggers’ tweets since Dean’s R life started” :)

I’m going to use some terminology very loosely and interchangeably throughout this post:

  • “blog” == “author” == “contributor”
  • “tweet” == “post”
  • “successful” post == “highly shared” == “high score” == “high quality”

It’s clear that all those terms not necessarily the same thing (for example, virality does not necessarily mean high quality), but I’ll be using them all as the same.

There is also an accompanying interactive document to supplement this post. That document has a few interactive plots/tables for data that is better explored interactively rather than as an image, and it also contains all the source code that was used to make the analysis and all the figures here. The source code is also available on GitHub as the raw text version of the interactive document. In this post I will not be including too much lengthy code, especially not for the plots.

Before going any further, I’d like to say that this is not intended to be a comprehensive analysis and definitely has many weaknesses. It’s just for fun. I’m not even going to be making any statistical significance tests at any point or do any quantitative analysis. Maybe titling this as “Analyzing” is wrong and should instead be “Exploring”? This post looks exclusively at data directly related to @Rbloggers tweets; I am not looking at data from any other social media or how many times the post was shared via R-Bloggers website rather than through Twitter. I’m also not looking at how much discussion (replies) a tweet generates. I wanted to include data from the number of times the “Tweet” button was pressed directly on R-Bloggers, but it looks like most older posts have 0 (maybe the button is a recent addition to R-Bloggers), so it’ll introduce an unfair bias towards new posts. And of course the biggest criticism here is that you simply can’t judge a blog post by the number of times it’s shared on Twitter, but here we go.

Table of contents Data preparation

This is the boring part - get the data from Twitter, fill in missing pieces of information, clean up… Feel free to skip to the more exciting part.

Get data from Twitter

As mentioned above, I could only grab the last 3200 tweets made by @Rbloggers, which equates to all tweets since Sept 2013. For each tweet I kept several pieces of information: tweet ID, tweet date, day of the week the tweet was made, number of times tweet was favorited, number of times tweet was retweeted, tweet text, and the last URL in the tweet text. The tweet text is essentially a blog post’s title that has been truncated. I keep the last URL that appears in every tweet because that URL always points to the article on R-Bloggers. I’m only storing the date but losing the actual time, which is another weakness. Also, the dates are according to UTC, and many Rbloggers followers are in America, so it might not be the most correct.

Anyway, after authenticating with Twitter, here’s the code to get this information using twitteR:

MAX_TWEETS <- 3200 tweets_raw <- userTimeline('Rbloggers', n = MAX_TWEETS, includeRts = FALSE, excludeReplies = TRUE) tweets <- ldply(tweets_raw, function(x) { data_frame(id = x$id, date = as.Date(x$created), day = weekdays(date), favorites = x$favoriteCount, retweets = x$retweetCount, title = x$text, url = x$urls %>% .[['url']] %>% tail(1) ) }) rm(tweets_raw) # being extremely memory conscious

Remember the full source code can be viewed here.

Scrape R-Bloggers to get author info

Since I had some questions about the post authors and a tweet doesn’t give that information, I resorted to scraping the R-Bloggers post linked in each tweet using httr to find the author. This part takes a bit of time to run. There were a few complications, mostly with authors whose name is their email and R-Bloggers attemps to hide it, but here is how I accomplished this step:

# Get the author of a single post given an R-Bloggers post URL get_post_author_single <- function(url) { if (is.null(url)) { return(NA) } # get author HTML node author_node <- GET(url) %>% httr::content("parsed") %>% getNodeSet("//a[@rel='author']") if (author_node %>% length != 1) { return(NA) } # get author name author <- author_node %>% .[[1]] %>% xmlValue # r-bloggers hides email address names so grab the email a different way if (nchar(author) > 100 && grepl("document\.getElementsByTagName", author)) { author <- author_node %>% .[[1]] %>% xmlGetAttr("title") } author } # Get a list of URL --> author for a list of R-Bloggers post URLs get_post_author <- function(urls) { lapply(urls, get_post_author_single) %>% unlist } # Add the author to each tweet. # This will take several minutes because we're scraping r-bloggers 3200 times (don't run this frequently - we don't want to overwork our beloved R-Bloggers server) tweets %<>% mutate(author = get_post_author(url)) # Remove NA author (these are mostly jobs postings, plus a few blog posts that have been deleted) tweets %<>% na.omit

The last line there removed any tweets without an author. That essentially removes all tweets that are advertising job postings and a few blog posts that have been deleted.

Clean up data

It’s time for some clean up:

  • Remove the URL and #rstats hashtag from every tweet’s title
  • Older posts all contain the text “This article was originally posted on … and kindly contributed by …” - try to remove that as well
  • Order the day factor levels in order from Monday - Sunday
  • Truncate very long author names with an ellipsis
  • Merge duplicate tweets (tweets with the same author and title that are posted within a week)

After removing duplicates and previously removing tweets about job postings, we are left with 2979 tweets (down from 3200). You can see what the data looks like here or see the code for the cleanup on that page as well.

Add a score metric

Now that we have almost all the info we need for the tweets, there is one thing missing. It’d be useful to have a metric for how successful a tweet is using the very little bit of information we have. This is of course very arbitrary. I chose to score a tweet’s success as a linear combination of its “# of favorites” and “# of retweets”. Since there are roughly twice as many favorites as retweets in total, retweets get twice the weight. Very simple formula :)

sum(tweets$favorites) / sum(tweets$retweets) # result = 2.1 tweets$score <- tweets$favorites + tweets$retweets * 2 Exploration

Time for the fun stuff! I’m only going to make a few plots, you can get the data from GitHub if you want to play around with it in more depth.

Scores of all tweets

First I’d like to see a simple scatterplot showing the number of favorites and retweets for each blog post.

Looks like most posts are close to the (0, 0) area, with 20 favorites and 10 retweets being the maximum boundary for most. A very small fraction of tweets make it past the 40 favorites or 20 retweets.

Most successful posts

From the previous plot it seems like there are about 10 posts that are much higher up than everyone else, so let’s see what the top 10 most shared Rbloggers posts on Twitter were since Sep 2013.

title date author favorites retweets score A new interactive interface for learning R online, for free 2015.04.14 DataCamp 78 49 176 Introducing Radiant: A shiny interface for R 2015.05.04 R(adiant) news 85 44 173 Choosing R or Python for data analysis? An infographic 2015.05.12 DataCamp 59 54 167 Why the Ban on P-Values? And What Now? 2015.03.07 Nicole Radziwill 47 47 141 Machine Learning in R for beginners 2015.03.26 DataCamp 68 29 126 Free Stanford online course on Statistical Learning (with R) starting on 19 Jan 2015 2014.11.22 Yanchang Zhao 54 35 124 Four Beautiful Python, R, MATLAB, and Mathematica plots with LaTeX 2014.12.20 Plotly 57 29 115 In-depth introduction to machine learning in 15 hours of expert videos 2014.09.24 Kevin Markham 64 20 104 Learn Statistics and R online from Harvard 2015.01.17 David Smith 49 27 103 R Tutorial on Reading and Importing Excel Files into R 2015.04.04 DataCamp 61 20 101

8/10 of the top 10 posts have “R” in their title… correlation or causation or random? Maybe I should start doing that too then!

Looks like the DataCamp blog is a pretty major Rbloggers contributor with 4/10 of the most tweeted posts. Which leads me perfectly into the next section.

Highest scoring authors

So as I just said, DataCamp looks like it contributes very high quality posts. I wanted to see which blogs contribute the most successful posts consistently. The following shows the authors with the highest average score per tweet.

author num_tweets avg_favorites avg_retweets avg_score R(adiant) news 1 85.0 44.0 173.0 Martin Schneider 1 34.0 25.0 84.0 Juuso Parkkinen 1 34.0 22.0 78.0 Tal Yarkoni 1 15.0 25.0 65.0 Kevin Markham 6 35.8 13.0 61.8 Dean Attali’s R Blog 5 32.2 13.2 58.6 filip Schouwenaars 1 26.0 16.0 58.0 Christian Groll 1 13.0 21.0 55.0 Josh Paulson 1 23.0 15.0 53.0 Kushan Shah 1 33.0 10.0 53.0

First impression: Woo, I’m in there! :D

Posts by highest scoring authors

Now that I know which blogs have the best posts on average, I wanted to see what each of their tweets looked like.

It’ll be nice to see how these compare to all other posts. The following figure shows the scores of all tweets, and highlights the posts made by any of the top-10 authors.

Pretty. But it looks like the list of top 10 authors is dominated by one-hit wonders, which makes sense because it’s much easier to put a lot of effort into one single post than to constantly pump out great articles over and over again. So let’s try again seeing who has the highest average score, but only consider blogs that contributed more than one post.

author num_tweets avg_favorites avg_retweets avg_score Kevin Markham 6 35.8 13.0 61.8 Dean Attali’s R Blog 5 32.2 13.2 58.6 Matt 2 27.5 12.0 51.5 Bruno Rodrigues 4 23.8 13.2 50.2 Plotly 7 25.6 11.6 48.7 DataCamp 32 23.3 12.4 48.0 Tim Phan 2 27.5 9.0 45.5 Slawa Rokicki 3 21.0 11.3 43.7 Jan Górecki - R 3 25.3 8.7 42.7 Nicole Radziwill 13 21.3 10.3 41.9

Ah, there’s DataCamp - by far more posts than the rest of us, and still a very high average score. Respect.

Who contributes the most?

I also wanted to know how many blogs contribute and how much each one contributes. R-Bloggers says on its frontpage that there are 573 blogs. According to my data, there are 420 unique authors since Sept 2013, so about 1/4 of the blogs have not posted since then. Here is the distribution of how many blog posts different blogs made:

Seems like a lot of people only posted once in the past 1.5 years. That graph is actually cut off at 50 because there are a few outliers (individuals who posted way more than 50). Let’s see who these power users are, so we know who to thank for most of the content.

author num_tweets avg_favorites avg_retweets avg_score David Smith 166 7.9 5.6 19.2 Thinking inside the box 118 1.8 0.9 3.6 Joseph Rickert 117 8.3 3.8 15.9 xi’an 88 3.1 1.2 5.6 Tal Galili 71 5.7 4.2 14.1

There you have it - the 5 people who single-handedly (or.. quintuple-handedly?) are responsible for 1/6 of the posts we’ve seen since I learned what R is.

Post success by day of week

One of my main questions was whether there is some correlation between when a post is posted and how successful it is. I also wanted to see if there are certain days of the week that are more/less active. Here is a table summarizing the number of posts made on each day of the week and how successful each day was on average.

day num_tweets favorites_per_post retweets_per_post avg_score Monday 451 7.6 3.7 15.0 Tuesday 551 7.2 3.3 13.8 Wednesday 461 7.1 3.4 13.9 Thursday 487 7.6 3.6 14.8 Friday 429 7.5 3.7 14.9 Saturday 323 8.7 4.3 17.3 Sunday 277 8.9 3.8 16.5

Cool! This actually produced some non-boring results. I’m not going to make any significance tests, but I do see two interesting pieces of information here. First of all, it looks like the weekend (Sat-Sun) is quieter than weekdays in terms on number of posts made. Second of all, the two days with the highest average score are also Sat-Sun. I won’t go into whether or not having ~1 more favorite and < 1 more retweet on average is significant, but it’s at least something. Maybe because there are less posts on the weekend, each post gets a bit more visibility and stays at the top of the feed longer, thereby having a small advantage? Or maybe the small difference in score we’re seeing is just because there are less posts in total and it’ll even out once n is large enough?

Whatever the case might be, here’s a plot that shows the score of every tweet grouped by day. The large points show the average of all posts made on that day.

Significant or not, at least it looks pretty.

Wordcloud

I must admit I’m not the biggest fan of wordclouds, but it feels like no amateur R analysis can be complete without one of these bad boys these days. Here you go wordcloud-lovers - the 100 most popular terms in R-Bloggers posts’ titles since Sept 2013.

I actually don’t have much to comment on that, there isn’t anything that strikes me as surprising here.

Remember to check out the accompanying interactive doc + source code!

To leave a comment for the author, please follow the link and comment on his blog: Dean Attali's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

random 0.2.4

Sun, 2015-05-17 23:01

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release of our random package for truly (hardware-based) random numbers as provided by random.org is now on CRAN.

The R 3.2.0 release brought the change to use an internal method="libcurl" which we are using if available; else the curl::curl() method added in release 0.2.3 is used. We are also a little more explicit about closing connection, and added really basic regression tests -- as it is hard to test hardware-based RNGs draws.

Courtesy of CRANberries comes a diffstat report for this release. Current and previous releases are available here as well as on CRAN.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

infuser: a template engine for R

Sun, 2015-05-17 16:00

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)

Version 0.1 of infuser was just released on CRAN.

infuser is a very basic templating engine. It allows you to replace parameters in character strings or text files with a given specified value.

I would often include some SQL code in my R scripts so that I could make parameters in the SQL code variable. But I was always a bit hesitant on the cleanliness of this. It irked me so much that I finally wrote this very simple package. The infuse function allows me to read in a text file (or a character string for that matter) and infuse it with the requested values.

This means that I can now put my SQL code in a separate .sql file :). Load in into a character string with all the desired values in place and use it to my liking. Of course, the same thing can be done for any text file (e.g. an .html).

Information on usage and examples can be found here: http://github.com/Bart6114/infuser

To leave a comment for the author, please follow the link and comment on his blog: FishyOperations. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The paper helicopter experiment

Sun, 2015-05-17 03:01

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

The paper helicopter is one of the devices to explain about design of experiments. The aim is to create the longest flying paper helicopter by means of experimental design.
Paper helicopters are a nice example, because they are cheap to make, easy to test landing time and sufficient variables to make it non obvious.
Rather than make and measure my own helicopters, I decided to use data from the internet. In this post I use data from williamghunter.net and http://www.rose-hulman.edu. There is more data on the internet, but these two are fairly similar. Both use a fractional factorial design of 16 runs and they have the same variables. However, a quick check showed that these were different results and, very important, the aliasing structure was different.
DataData were taken from the above given locations. Rather than using the coded units, the data was converted to sizes in cm. Time to land was converted to seconds.
Since these were separate experiments, it has to be assumed that they used different paper, different heights to drop helicopters from. It even seems, that different ways were found to attach a paperclip to the helicopters.
Simple analysisTo confirm the data an analysis on coded units was performed. These results were same as given by the websites, results not shown here. My own analysis starts with real world units and is by regression. Disadvantage or real world units is that one cannot compare the size of the effects, however, given the designs used, the t-statistic can be used for this purpose.
The first data set shows WingLength and BodyLength to have the largest effects.
Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.92798    0.54903   3.512 0.009839 ** 
PaperTyperegular1 -0.12500    0.13726  -0.911 0.392730    
WingLength         0.17435    0.03088   5.646 0.000777 ***
BodyLength        -0.08999    0.03088  -2.914 0.022524 *  
BodyWidth          0.01312    0.07205   0.182 0.860634    
PaperClipYes       0.05000    0.13726   0.364 0.726403    
FoldYes           -0.10000    0.13726  -0.729 0.489918    
TapedBodyYes      -0.15000    0.13726  -1.093 0.310638    
TapedWingYes       0.17500    0.13726   1.275 0.242999  
The second data set shows WingLength, PaperClip and PaperType to have the largest effects. Coefficients:                  Estimate Std. Error t value Pr(>|t|)    (Intercept)        0.73200    0.21737   3.368  0.01196 *  PaperTyperegular2  0.28200    0.06211   4.541  0.00267 ** WingLength         0.16654    0.01223  13.622  2.7e-06 ***BodyLength        -0.02126    0.01630  -1.304  0.23340    BodyWidth         -0.03307    0.04890  -0.676  0.52058    PaperClipYes      -0.35700    0.06211  -5.748  0.00070 ***FoldYes            0.04500    0.06211   0.725  0.49222    TapedBodyYes      -0.14700    0.06211  -2.367  0.04983 *  TapedWingYes       0.06600    0.06211   1.063  0.32320It seems then, that the two experiments show somewhat different effects. WingLength is certainly important. BodyLength maybe. Regarding paper, both have regular paper, but one has bond paper and the other construction paper. It is not difficult to imagine these are quite different.
Combined analysisThe combination analysis is programmed in Jags. To capture a different falling distance, a Mul parameter is used, which defines a multiplicative effect between the two experiments. In addition, both sets have their own measurement error. There are four types of paper, two from each data set, and three levels of paperclip, no paperclip assumed same for both experiments. In addition to the parameters given earlier, residuals are estimated, in order to have some idea about the quality of fit.The model then, looks like thisjmodel <- function() {
  for (i in 1:n) {     
    premul[i] <- (test[i]==1)+Mul*(test[i]==2)
    mu[i] <- premul[i] * (
          WL*WingLength[i]+
          BL*BodyLength[i] + 
          PT[PaperType[i]] +
          BW*BodyWidth[i] +
          PC[PaperClip[i]] +
          FO*Fold[i]+
          TB*TapedBody[i]+
          TW*TapedWing[i]
          )
    Time[i] ~ dnorm(mu[i],tau[test[i]])
    residual[i] <- Time[i]-mu[i]
  }
  for (i in 1:2) {
    tau[i] <- pow(StDev[i],-2)
    StDev[i] ~dunif(0,3)
  }
  for (i in 1:4) {
    PT[i] ~ dnorm(PTM,tauPT)
  }
  tauPT <- pow(sdPT,-2)
  sdPT ~dunif(0,3)
  PTM ~dnorm(0,0.01)
  WL ~dnorm(0,0.01)
  BL ~dnorm(0,1000)
  BW ~dnorm(0,1000)
  PC[1] <- 0
  PC[2]~dnorm(0,0.01)
  PC[3]~dnorm(0,0.01)
  
  FO ~dnorm(0,1000)
  TB ~dnorm(0,0.01)
  TW ~dnorm(0,0.01)

  Mul ~ dnorm(1,1) %_% I(0,2)
}

Inference for Bugs model at "C:/Users/Kees/AppData/Local/Temp/Rtmp4o0rhh/model16f468e854ce.txt", fit using jags,
 4 chains, each with 3000 iterations (first 1500 discarded)
 n.sims = 6000 iterations saved
             mu.vect sd.vect    2.5%     25%     50%     75%  97.5%  Rhat n.eff
BL            -0.029   0.014  -0.056  -0.038  -0.028  -0.019 -0.001 1.001  4400
BW            -0.005   0.025  -0.052  -0.023  -0.006   0.011  0.044 1.002  1900
FO             0.005   0.028  -0.050  -0.014   0.005   0.023  0.058 1.001  6000
Mul            1.166   0.149   0.819   1.087   1.176   1.254  1.433 1.028   130
PC[1]          0.000   0.000   0.000   0.000   0.000   0.000  0.000 1.000     1
PC[2]          0.066   0.141  -0.208  -0.021   0.061   0.147  0.360 1.002  2300
PC[3]         -0.362   0.070  -0.501  -0.404  -0.362  -0.319 -0.225 1.001  6000
PT[1]          1.111   0.397   0.516   0.864   1.059   1.286  2.074 1.021   150
PT[2]          1.019   0.379   0.437   0.783   0.974   1.186  1.925 1.019   160
PT[3]          0.728   0.170   0.397   0.615   0.728   0.840  1.068 1.002  2900
PT[4]          0.991   0.168   0.655   0.885   0.993   1.103  1.309 1.002  1600
StDev[1]       0.133   0.039   0.082   0.108   0.127   0.150  0.225 1.005   540
StDev[2]       0.304   0.075   0.192   0.251   0.292   0.343  0.488 1.003  1300
TB            -0.144   0.059  -0.264  -0.181  -0.144  -0.108 -0.025 1.001  4100
TW             0.084   0.059  -0.033   0.045   0.084   0.122  0.203 1.001  4400
WL             0.164   0.013   0.138   0.156   0.164   0.172  0.188 1.004   810
residual[1]    0.174   0.146  -0.111   0.079   0.173   0.268  0.464 1.002  1700
residual[2]    0.466   0.158   0.162   0.361   0.463   0.567  0.780 1.004   730
residual[3]    0.150   0.170  -0.173   0.041   0.147   0.253  0.499 1.003  1100
residual[4]   -0.416   0.162  -0.733  -0.523  -0.418  -0.308 -0.099 1.001  3800
residual[5]   -0.087   0.168  -0.419  -0.198  -0.084   0.026  0.238 1.005   560
residual[6]   -0.085   0.156  -0.397  -0.184  -0.084   0.016  0.221 1.003  1200
residual[7]   -0.056   0.159  -0.371  -0.156  -0.055   0.047  0.251 1.003   910
residual[8]   -0.203   0.157  -0.527  -0.304  -0.198  -0.100  0.095 1.001  6000
residual[9]    0.150   0.150  -0.139   0.052   0.148   0.247  0.451 1.001  6000
residual[10]   0.103   0.156  -0.200   0.003   0.101   0.206  0.415 1.004   720
residual[11]   0.133   0.160  -0.176   0.027   0.131   0.237  0.454 1.002  2100
residual[12]   0.335   0.177  -0.006   0.218   0.332   0.451  0.689 1.004   830
residual[13]  -0.436   0.156  -0.747  -0.536  -0.436  -0.337 -0.128 1.002  2100
residual[14]   0.098   0.162  -0.227  -0.007   0.099   0.205  0.410 1.004   670
residual[15]  -0.018   0.160  -0.340  -0.118  -0.015   0.084  0.292 1.003   920
residual[16]  -0.127   0.155  -0.441  -0.224  -0.125  -0.027  0.173 1.001  3600
residual[17]   0.037   0.088  -0.135  -0.018   0.037   0.093  0.215 1.002  1600
residual[18]  -0.088   0.090  -0.274  -0.141  -0.086  -0.031  0.081 1.002  2500
residual[19]  -0.074   0.088  -0.248  -0.129  -0.072  -0.018  0.100 1.002  1900
residual[20]  -0.079   0.088  -0.259  -0.133  -0.076  -0.023  0.091 1.001  3800
residual[21]  -0.037   0.087  -0.201  -0.093  -0.039   0.016  0.141 1.002  3000
residual[22]   0.051   0.087  -0.128  -0.001   0.053   0.107  0.221 1.001  4800
residual[23]  -0.008   0.084  -0.177  -0.061  -0.009   0.046  0.159 1.001  5500
residual[24]   0.129   0.086  -0.047   0.076   0.130   0.185  0.294 1.002  1900
residual[25]   0.196   0.087   0.030   0.141   0.196   0.249  0.370 1.003  1400
residual[26]  -0.027   0.084  -0.195  -0.081  -0.026   0.029  0.138 1.001  6000
residual[27]   0.070   0.088  -0.101   0.016   0.070   0.124  0.247 1.001  3700
residual[28]  -0.166   0.089  -0.355  -0.221  -0.163  -0.108  0.004 1.001  3700
residual[29]  -0.052   0.087  -0.223  -0.107  -0.053   0.002  0.124 1.001  4300
residual[30]   0.039   0.089  -0.139  -0.016   0.038   0.095  0.218 1.002  2500
residual[31]  -0.079   0.089  -0.245  -0.135  -0.080  -0.026  0.103 1.002  2300
residual[32]   0.048   0.085  -0.122  -0.006   0.049   0.102  0.214 1.002  2300
deviance     -15.555   7.026 -26.350 -20.655 -16.487 -11.540  0.877 1.004   750

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 24.6 and DIC = 9.0
DIC is an estimate of expected predictive error (lower deviance is better).
Striking in the results is big residuals, for instance for observations 2, 4 and 13. The residuals for observations 4 and 13 are also big when a similar classical model is used, hence this is a clear indication of some kind of interaction. Model with interactionsAdding the most obvious interactions, such as WingLength*TapedBody did not really provide a suitable answer. Indeed, large residuals at observations 4 and 13, which are at opposite sides in the fractional factorial design, can not be resolved with one interaction.Hence I proceeded with adding all two way interactions. Since this was expected to result in a model without clear estimates, all interactions had a strong prior; mean was 0 and precision (tau) was 1000. This model was subsequently reduced by giving the interactions which clearly differed from 0 a lesser precision while interactions which where clearly zero were removed. During this process the parameter Fold was removed from the parameter set. Finally, quadratic effects were added. There is one additional parameter, other, it has no function in the model, but tells what the properties of the prior for the interactions are. Parameters with a standard deviation less than other have information added from the data.jmodel <- function() {  for (i in 1:n) {         premul[i] <- (test[i]==1)+Mul*(test[i]==2)    mu[i] <- premul[i] * (          WL*WingLength[i]+          BL*BodyLength[i] +           PT[PaperType[i]] +          BW*BodyWidth[i] +          PC[PaperClip[i]] +          TB*TapedBody[i]+          TW*TapedWing[i]+                    WLBW*WingLength[i]*BodyWidth[i]+          WLPC[1]*WingLength[i]*(PaperClip[i]==2)+          WLPC[2]*WingLength[i]*(PaperClip[i]==3)+                    BLPT[1]*BodyLength[i]*(PaperType[i]==2)+          BLPT[2]*BodyLength[i]*(PaperType[i]==3)+          BLPC[1]*BodyLength[i]*(PaperClip[i]==2)+          BLPC[2]*BodyLength[i]*(PaperClip[i]==3)+                    BWPC[1]*BodyWidth[i]*(PaperClip[i]==2)+          BWPC[2]*BodyWidth[i]*(PaperClip[i]==3) +                    WLWL*WingLength[i]*WingLength[i]+          BLBL*BodyLength[i]*BodyLength[i]+          BWBW*BodyWidth[i]*BodyWidth[i]                              )    Time[i] ~ dnorm(mu[i],tau[test[i]])    residual[i] <- Time[i]-mu[i]  }  for (i in 1:2) {    tau[i] <- pow(StDev[i],-2)    StDev[i] ~dunif(0,3)    WLPC[i] ~dnorm(0,1)    BLPT[i] ~dnorm(0,1)    BLPC[i] ~dnorm(0,1)     BWPC[i] ~dnorm(0,1)        }  for (i in 1:3) {    PT[i] ~ dnorm(PTM,tauPT)  }  tauPT <- pow(sdPT,-2)  sdPT ~dunif(0,3)  PTM ~dnorm(0,0.01)  WL ~dnorm(0,0.01)   BL ~dnorm(0,0.01)  BW ~dnorm(0,0.01)  PC[1] <- 0  PC[2]~dnorm(0,0.01)  PC[3]~dnorm(0,0.01)   TB ~dnorm(0,0.01)  TW ~dnorm(0,0.01)    WLBW~dnorm(0,1)  WLTW~dnorm(0,1)    WLWL~dnorm(0,1)  BLBL~dnorm(0,1)   BWBW~dnorm(0,1)    other~dnorm(0,1)  Mul ~ dnorm(1,1) %_% I(0,2)}Inference for Bugs model at "C:/Users/Kees/AppData/Local/Temp/Rtmp4o0rhh/model16f472b05364.txt", fit using jags, 5 chains, each with 4000 iterations (first 2000 discarded), n.thin = 2 n.sims = 5000 iterations saved             mu.vect sd.vect    2.5%     25%     50%     75%  97.5%  Rhat n.effBL             0.021   0.197  -0.367  -0.080   0.027   0.121  0.396 1.021   590BLBL          -0.001   0.015  -0.027  -0.009  -0.003   0.006  0.031 1.015  1200BLPC[1]       -0.099   0.105  -0.295  -0.125  -0.086  -0.053  0.021 1.100   560BLPC[2]       -0.110   0.111  -0.334  -0.134  -0.094  -0.060  0.018 1.130   250BLPT[1]       -0.038   0.190  -0.503  -0.124   0.001   0.069  0.286 1.005   600BLPT[2]        0.058   0.038  -0.031   0.045   0.063   0.078  0.113 1.063   400BW            -0.430   0.558  -1.587  -0.657  -0.389  -0.143  0.463 1.045   960BWBW           0.009   0.094  -0.160  -0.031   0.009   0.052  0.176 1.053  1300BWPC[1]       -0.224   0.173  -0.615  -0.295  -0.209  -0.133  0.064 1.011  5000BWPC[2]       -0.093   0.101  -0.285  -0.137  -0.091  -0.044  0.085 1.040  5000Mul            1.053   0.145   0.680   0.997   1.069   1.139  1.281 1.098   290PC[1]          0.000   0.000   0.000   0.000   0.000   0.000  0.000 1.000     1PC[2]          1.459   2.367  -3.571   0.333   1.565   2.617  6.138 1.019   420PC[3]          0.401   0.732  -0.619   0.032   0.309   0.629  1.954 1.074   320PT[1]          1.353   1.437  -1.364   0.556   1.318   2.088  4.128 1.032   480PT[2]          1.906   1.767  -1.087   0.828   1.726   2.814  5.879 1.013  1300PT[3]          0.731   1.419  -1.864  -0.058   0.682   1.444  3.535 1.032   520StDev[1]       0.108   0.082   0.045   0.067   0.088   0.120  0.302 1.023   450StDev[2]       0.267   0.156   0.122   0.177   0.229   0.301  0.706 1.021   390TB            -0.146   0.051  -0.247  -0.172  -0.145  -0.119 -0.048 1.011  5000TW             0.086   0.054  -0.007   0.055   0.082   0.112  0.204 1.010  1700WL             0.209   0.380  -0.496   0.007   0.188   0.394  1.035 1.014   670WLBW           0.051   0.062  -0.013   0.026   0.043   0.062  0.167 1.159   220WLPC[1]        0.057   0.210  -0.304  -0.063   0.024   0.152  0.556 1.004  1600WLPC[2]        0.020   0.027  -0.031   0.010   0.021   0.033  0.066 1.044  2400WLWL          -0.014   0.026  -0.072  -0.026  -0.011   0.001  0.032 1.014  5000other          0.002   1.007  -1.973  -0.680   0.000   0.684  1.952 1.002  2200residual[1]    0.227   0.272  -0.178   0.066   0.190   0.334  0.935 1.041   390residual[2]    0.035   0.231  -0.447  -0.084   0.037   0.160  0.503 1.007  2500residual[3]    0.026   0.269  -0.404  -0.118  -0.002   0.131  0.587 1.039   430residual[4]   -0.123   0.279  -0.542  -0.276  -0.157  -0.018  0.530 1.053   370residual[5]   -0.046   0.241  -0.535  -0.168  -0.043   0.083  0.422 1.008  5000residual[6]   -0.094   0.241  -0.568  -0.221  -0.095   0.035  0.390 1.005  2600residual[7]    0.284   0.268  -0.139   0.140   0.263   0.392  0.861 1.046   430residual[8]    0.018   0.240  -0.460  -0.107   0.022   0.144  0.494 1.006  5000residual[9]    0.121   0.299  -0.310  -0.042   0.079   0.223  0.827 1.054   300residual[10]   0.038   0.237  -0.428  -0.086   0.034   0.155  0.518 1.006  3100residual[11]  -0.077   0.251  -0.562  -0.204  -0.073   0.046  0.401 1.020  5000residual[12]   0.153   0.262  -0.286   0.013   0.133   0.267  0.711 1.035   610residual[13]  -0.024   0.244  -0.466  -0.160  -0.035   0.095  0.493 1.008  5000residual[14]  -0.019   0.244  -0.537  -0.140  -0.013   0.111  0.456 1.006  5000residual[15]  -0.159   0.250  -0.663  -0.281  -0.156  -0.038  0.302 1.026   860residual[16]   0.034   0.273  -0.531  -0.076   0.056   0.178  0.486 1.037   410residual[17]   0.001   0.115  -0.185  -0.057  -0.008   0.047  0.232 1.047   890residual[18]   0.016   0.105  -0.187  -0.038   0.017   0.067  0.211 1.014  3300residual[19]  -0.068   0.108  -0.262  -0.118  -0.068  -0.017  0.127 1.036  5000residual[20]   0.067   0.114  -0.138   0.017   0.067   0.115  0.270 1.046  4500residual[21]   0.003   0.117  -0.223  -0.046   0.007   0.057  0.203 1.044  3200residual[22]  -0.004   0.113  -0.202  -0.059  -0.007   0.044  0.211 1.035  2000residual[23]  -0.039   0.134  -0.313  -0.081  -0.023   0.027  0.145 1.097   300residual[24]   0.009   0.114  -0.197  -0.042   0.009   0.061  0.223 1.039  5000residual[25]   0.045   0.110  -0.170  -0.005   0.046   0.095  0.248 1.028  5000residual[26]  -0.044   0.108  -0.252  -0.096  -0.043   0.007  0.165 1.024  4000residual[27]   0.046   0.112  -0.164  -0.005   0.046   0.100  0.264 1.022  3600residual[28]  -0.062   0.115  -0.296  -0.104  -0.053  -0.004  0.112 1.047  1400residual[29]  -0.025   0.143  -0.321  -0.064  -0.006   0.042  0.153 1.110   230residual[30]  -0.016   0.118  -0.228  -0.066  -0.015   0.037  0.196 1.042  1400residual[31]  -0.025   0.115  -0.239  -0.072  -0.021   0.028  0.174 1.047  1300residual[32]   0.020   0.111  -0.176  -0.033   0.017   0.066  0.233 1.041  2600deviance     -32.864  19.923 -62.354 -46.843 -35.763 -22.807 16.481 1.014   420
For each parameter, n.eff is a crude measure of effective sample size,and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
DIC info (using the rule, pD = var(deviance)/2)pD = 196.7 and DIC = 163.8DIC is an estimate of expected predictive error (lower deviance is better).Model discussionThis model does not have the big residuals. In addition it seems that some parameters, e.g. WLWL and WLBW have small mean values and small standard deviations. To me this suggests that they are indeed estimated and found to be close to 0. After all, if the data contained no information, their standard deviation would be similar to the prior, which is much larger, as seen from the other parameter.
The quadratic effects were added to allow detection of a maximum. There is not much presence of these effects, except perhaps in WingLength (parameter WLWL).
For descriptive purposes, I will leave these parameters in. However, for predictive purposes, it may be better to remove them or shrink them closer to zero.
Given the complex way in which the parameters are chosen, it is very well possible that a different model would be better. In hindsight, I might have used the BMA function to do a more thorough selection. Thus the model needs to be validated some more. Since I found two additional data sets on line, these might be used for this purpose.
Codeh1 <- read.table(sep='t',header=TRUE,text='        PaperType WingLength BodyLength BodyWidth PaperClip Fold TapedBody TapedWing Timeregular1 7.62 7.62 3.175 No No No No 2.5bond 7.62 7.62 3.175 Yes No Yes Yes 2.9regular1 12.065 7.62 3.175 Yes Yes No Yes 3.5bond 12.065 7.62 3.175 No Yes Yes No 2.7regular1 7.62 12.065 3.175 Yes Yes Yes No 2bond 7.62 12.065 3.175 No Yes No Yes 2.3regular1 12.065 12.065 3.175 No No Yes Yes 2.9bond 12.065 12.065 3.175 Yes No No No 3regular1 7.62 7.62 5.08 No Yes Yes Yes 2.4bond 7.62 7.62 5.08 Yes Yes No No 2.6regular1 12.065 7.62 5.08 Yes No Yes No 3.2bond 12.065 7.62 5.08 No No No Yes 3.7regular1 7.62 12.065 5.08 Yes No No Yes 1.9bond 7.62 12.065 5.08 No No Yes No 2.2regular1 12.065 12.065 5.08 No Yes No No 3bond 12.065 12.065 5.08 Yes Yes Yes Yes 3')
h2 <- read.table(sep='t',header=TRUE,text='        PaperType BodyWidth BodyLength WingLength PaperClip Fold TapedBody TapedWing Timeregular2 2.54 3.81 5.08 No No No No 1.74construction 2.54 3.81 5.08 No Yes Yes Yes 1.296regular2 3.81 3.81 5.08 Yes No Yes Yes 1.2construction 3.81 3.81 5.08 Yes Yes No No 0.996regular2 2.54 7.62 5.08 Yes Yes Yes No 1.056construction 2.54 7.62 5.08 Yes No No Yes 1.104regular2 3.81 7.62 5.08 No Yes No Yes 1.668construction 3.81 7.62 5.08 No No Yes No 1.308regular2 2.54 3.81 10.16 Yes Yes No Yes 2.46construction 2.54 3.81 10.16 Yes No Yes No 1.74regular2 3.81 3.81 10.16 No Yes Yes No 2.46construction 3.81 3.81 10.16 No No No Yes 2.184regular2 2.54 7.62 10.16 No No Yes Yes 2.316construction 2.54 7.62 10.16 No Yes No No 2.208regular2 3.81 7.62 10.16 Yes No No No 1.98construction 3.81 7.62 10.16 Yes Yes Yes Yes 1.788')
l1 <- lm(Time ~ PaperType + WingLength + BodyLength + BodyWidth +         PaperClip + Fold + TapedBody + TapedWing, data=h1)summary(l1)residuals(l1)
l2 <- lm(Time ~ PaperType + WingLength + BodyLength + BodyWidth +         PaperClip + Fold + TapedBody + TapedWing, data=h2)summary(l2)
h1$test <- 'WH'# WingLength, BodyLengthh2$test <- 'RH'#WhingLegnth, PaperClip, PaperType
helis <- rbind(h1,h2)helis$test <- factor(helis$test)
helis$PaperClip2 <- factor(ifelse(helis$PaperClip=='No','No',as.character(helis$test)),    levels=c('No','WH','RH'))
library(R2jags)datain <- list(    PaperType=c(1:4)[helis$PaperType],    WingLength=helis$WingLength,    BodyLength=helis$BodyLength,    BodyWidth=helis$BodyWidth,    PaperClip=c(1,2,3)[helis$PaperClip2],    Fold=c(0,1)[helis$Fold],    TapedBody=c(0,1)[helis$TapedBody],    TapedWing=c(0,1)[helis$TapedWing],    test=c(1,2)[helis$test],    Time=helis$Time,    n=nrow(helis))parameters <- c('Mul','WL','BL','PT','BW','PC','FO','TB','TW','StDev','residual')
jmodel <- function() {  for (i in 1:n) {         premul[i] <- (test[i]==1)+Mul*(test[i]==2)    mu[i] <- premul[i] * (          WL*WingLength[i]+          BL*BodyLength[i] +           PT[PaperType[i]] +          BW*BodyWidth[i] +          PC[PaperClip[i]] +          FO*Fold[i]+          TB*TapedBody[i]+          TW*TapedWing[i]          )    Time[i] ~ dnorm(mu[i],tau[test[i]])    residual[i] <- Time[i]-mu[i]  }  for (i in 1:2) {    tau[i] <- pow(StDev[i],-2)    StDev[i] ~dunif(0,3)  }  for (i in 1:4) {    PT[i] ~ dnorm(PTM,tauPT)  }  tauPT <- pow(sdPT,-2)  sdPT ~dunif(0,3)  PTM ~dnorm(0,0.01)  WL ~dnorm(0,0.01)  BL ~dnorm(0,1000)  BW ~dnorm(0,1000)  PC[1] <- 0  PC[2]~dnorm(0,0.01)  PC[3]~dnorm(0,0.01)    FO ~dnorm(0,1000)  TB ~dnorm(0,0.01)  TW ~dnorm(0,0.01)
  Mul ~ dnorm(1,1) %_% I(0,2)}
jj <- jags(model.file=jmodel,    data=datain,    parameters=parameters,    progress.bar='gui',    n.chain=4,    n.iter=3000,    inits=function() list(Mul=1.3,WL=0.15,BL=-.08,PT=rep(1,4),          BW=0,PC=c(NA,0,0),FO=0,TB=0,TW=0))jj
#################################
datain <- list(
    PaperType=c(2,1,3,1)[helis$PaperType],
    WingLength=helis$WingLength,
    BodyLength=helis$BodyLength,
    BodyWidth=helis$BodyWidth,
    PaperClip=c(1,2,3)[helis$PaperClip2],
    TapedBody=c(0,1)[helis$TapedBody],
    TapedWing=c(0,1)[helis$TapedWing],
    test=c(1,2)[helis$test],
    Time=helis$Time,
    n=nrow(helis))

parameters <- c('Mul','WL','BL','PT','BW','PC','TB','TW','StDev','residual',
    'WLBW','WLPC',            'WLWL',
    'BLPT'       ,'BLPC',     'BLBL',
    'BWPC',                   'BWBW',  'other')

jmodel <- function() {
  for (i in 1:n) {     
    premul[i] <- (test[i]==1)+Mul*(test[i]==2)
    mu[i] <- premul[i] * (
          WL*WingLength[i]+
          BL*BodyLength[i] + 
          PT[PaperType[i]] +
          BW*BodyWidth[i] +
          PC[PaperClip[i]] +
          TB*TapedBody[i]+
          TW*TapedWing[i]+
          
          WLBW*WingLength[i]*BodyWidth[i]+
          WLPC[1]*WingLength[i]*(PaperClip[i]==2)+
          WLPC[2]*WingLength[i]*(PaperClip[i]==3)+
          
          BLPT[1]*BodyLength[i]*(PaperType[i]==2)+
          BLPT[2]*BodyLength[i]*(PaperType[i]==3)+
          BLPC[1]*BodyLength[i]*(PaperClip[i]==2)+
          BLPC[2]*BodyLength[i]*(PaperClip[i]==3)+
          
          BWPC[1]*BodyWidth[i]*(PaperClip[i]==2)+
          BWPC[2]*BodyWidth[i]*(PaperClip[i]==3) +
          
          WLWL*WingLength[i]*WingLength[i]+
          BLBL*BodyLength[i]*BodyLength[i]+
          BWBW*BodyWidth[i]*BodyWidth[i]
          
          
          )
    Time[i] ~ dnorm(mu[i],tau[test[i]])
    residual[i] <- Time[i]-mu[i]
  }
  for (i in 1:2) {
    tau[i] <- pow(StDev[i],-2)
    StDev[i] ~dunif(0,3)
    WLPC[i] ~dnorm(0,1)
    BLPT[i] ~dnorm(0,1)
    BLPC[i] ~dnorm(0,1) 
    BWPC[i] ~dnorm(0,1)      
  }
  for (i in 1:3) {
    PT[i] ~ dnorm(PTM,tauPT)
  }
  tauPT <- pow(sdPT,-2)
  sdPT ~dunif(0,3)
  PTM ~dnorm(0,0.01)
  WL ~dnorm(0,0.01) 
  BL ~dnorm(0,0.01)
  BW ~dnorm(0,0.01)
  PC[1] <- 0
  PC[2]~dnorm(0,0.01)
  PC[3]~dnorm(0,0.01) 
  TB ~dnorm(0,0.01)
  TW ~dnorm(0,0.01)
  
  WLBW~dnorm(0,1)
  WLTW~dnorm(0,1)
  
  WLWL~dnorm(0,1)
  BLBL~dnorm(0,1) 
  BWBW~dnorm(0,1)
  
  other~dnorm(0,1)
  Mul ~ dnorm(1,1) %_% I(0,2)
}

jj <- jags(model.file=jmodel,
    data=datain,
    parameters=parameters,
    progress.bar='gui',
    n.chain=5,
    n.iter=4000,
    inits=function() list(Mul=1.3,WL=0.15,BL=-.08,PT=rep(1,3),
          PC=c(NA,0,0),TB=0,TW=0))
jj

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Learning about classes in R with plot.bike()

Sat, 2015-05-16 20:00

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

A useful feature of R is its ability to implement a function differently depending on the ‘class’ of the object acted on. This article explores this behaviour with reference to a playful modification of the ‘generic’ function plot() to allow plotting of cartoon bicycles. Although the example is quite simple and fun, the concepts it touches on are complex and serious.

The example demonstrates several of the programming language paradigms that R operates under. R is simultaneously object-orientated, functional and polymorphic. The example also demonstrates the paradigm of inheritance, through the passing of arguments from plot.bike() to plot() via the ... symbol. There has been much written about programming paradigms and R’s adherence to (or flouting of!) them. Two useful references on the subject are a Wikibook page on programming language paradigms and Hadley Wickham's Advanced R book. There is a huge amount of information on these topics. For the purposes of the examples presented here suffice to say that R uses multiple paradigms and is extremely flexible.

Context: an advanced R course

The behaviour of ‘generic functions’ such as plot was taught during a 2 day course by Colin Gillespie in Newcastle. In it we learned about some of the nuts and bolts that underlie R’s uniquely flexible and sometimes bizarre syntax. Environments, start-up, functions and classes were some of the topics covered. These and more issues are described in multiple places on-line and in R’s own documentation, and neatly synthesised in Hadley Wickham’s penultimate book, Advanced R. However, nothing beats face-to-face learning and I learned plenty about R’s innards during the course, despite having read around the topics covered previously.

Colin has made his materials available on-line on github for the benefit of people worldwide. http://rcourses.github.io/ contains links to pages which introduce a number of courses which can, to a large extent, be conducted from the safety of one’s home. There are also R packages for each of the courses. The package for the Advanced R course, for example, can be installed with the following code:

Once the package has been installed and loaded, with library(nclRadvanced), a number of vignettes and solutions sheets can be accessed, e.g. via:

vignette(package = "nclRadvanced") vignette(package = "nclRadvanced", "practical2") Creating a new S3 class for bikes

The S3 class system is very flexible. Any object can be allocated to a class of any name, without restriction. S3 classes only become meaningful when objects allocated to particular class are passed to a function that recognises classes. Functions that behave differently depending on the class of the object they act on are known as generic.

We can find out the class type of an object using the pryr package. A good example of the S3 object type is “lm”, which plots in a different way thanks to plot.lm(), which dispatches the plot() command differently for objects within the lm S3 object class.

x <- 1:9 y <- x^2 m <- lm(y ~ x) class(m) ## [1] "lm" pryr::otype(m) # requires pryr to be installed ## [1] "S3"

Note that the object system is flexible, so any class name can be allocated to any object, such as class(x) <- "lm". Note that if we enter this, plot(x) will try to dispatch x to plot.lm() and fail.

Classes only become useful when they have a series of generic methods associated with them. We will illustrate this by defining a list as a ‘bike’ object and creating a plot.bike(), a class-specific method of the generic plot function for plotting S3 objects of that class. Let’s define the key components of a bike:

Not that there are no strict rules. We could allocate the class to any object, and we could replace bike with almost any name. The S4 class, used in spatial data for example, is much stricter.

The bike class becomes useful when it comes to method dispatch, such as plotting.

Creating a plot method for bikes

Suppose that every bike object has the same as those contained in the object x created above. We can specify how it should be plotted as follows:

Now that a new method has been added to the generic plot() function, the fun begins. Any object assigned to the class ‘bike’ will now automatically be dispatched to plot.bike() when plot() is called.

And, as the plots below show, a plot of a bicycle is produced.

plot(x)

Try playing with the wheel size - some bikes with quite strange dimensions can be produced!

x$ws <- 1500 # a bike with large wheels plot(x)

x$ws <- 150 # a bike with small wheels plot(x)

It would be interesting to see how the dimensions of the last bicycle compare with a Brompton!

Discussion

The bike class demonstrates that the power of S3 classes lies not in the class’s object but in the generic functions which take-on new methods. It is precisely this behaviour which makes the class family of Spatial* objects defined by the sp package so powerful. sp adds new methods for plot(), aggregate() and even the subsetting function "[".

This can be seen by calling methods() before and after sp is loaded:

methods(aggregate) ## [1] aggregate.data.frame aggregate.default* aggregate.formula* ## [4] aggregate.ts ## see '?methods' for accessing help and source code library(sp) # load the sp library, which creates new methods methods(aggregate) # the new method is now shown ## [1] aggregate.data.frame aggregate.default* aggregate.formula* ## [4] aggregate.Spatial* aggregate.ts ## see '?methods' for accessing help and source code

Note that Spatial classes are different from the bike class because they use the S4 class system. We will be covering the nature and behaviour of Spatial objects in the “Spatial data analysis with R” course in Newcastle, 2nd - 3rd June, which is still open for registration.

The bike class is not ‘production’ ready but there is no reason why someone who understands bicycles inside out could not create a well-defined (perhaps S4) class for a bicycle, with all the essential dimensions defined. This could really be useful, including in efforts at making R more useful for transport planning, such as my package under development to provide tools for transportation research and analysis, stplanr.

Having learned about classes, I’m wondering whether origin-destination ‘flow’ data, used in stplanr, would benefit from its own class, or if its current definition as SpatialLinesDataFrame is sufficient. Any ideas welcome!

Conclusion

Classes are an advanced topic in R the usually just ‘work’. However, if you want to modify existing functions to behave differently on new object-types, understanding how to create classes and class-specific methods can be very useful. The example of the bike class created above is not intended for production, but provides a glimpse into what is possible. At the very least, this article should help provide readers with new insight into the inner workings of R and its impressive functional flexibility.

Post script

If you are interested in using R for transport research, please check out my under-development package stplanr and let me know via GitHub of any features you’d like it to have before submission to CRAN and rOpenSci:

https://github.com/Robinlovelace/stplanr

Or tweet me on @robinlovelace

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R Recipe: Reordering Columns in a Flexible Way

Sat, 2015-05-16 02:20

(This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers)

Suppose you have a data frame with a number of columns.

> names(trading) [1] "OpenDate" "CloseDate" "Symbol" "Action" "Lots" "SL" "TP" "OpenPrice" [9] "ClosePrice" "Commission" "Swap" "Pips" "Profit" "Gain" "Duration" "Trader" [17] "System"

You want to put the Trader and System columns first but you also want to do this in a flexible way. One approach would be to specify column numbers.

> trading = trading[, c(16:17, 1:15)] > names(trading) [1] "Trader" "System" "OpenDate" "CloseDate" "Symbol" "Action" "Lots" "SL" [9] "TP" "OpenPrice" "ClosePrice" "Commission" "Swap" "Pips" "Profit" "Gain" [17] "Duration"

This does the job but it's not very flexible. After all, the number of columns might change. Rather do it by specifying column names.

> refcols <- c("Trader", "System") > # > trading <- trading[, c(refcols, setdiff(names(trading), refcols))] > names(trading) [1] "Trader" "System" "OpenDate" "CloseDate" "Symbol" "Action" "Lots" "SL" [9] "TP" "OpenPrice" "ClosePrice" "Commission" "Swap" "Pips" "Profit" "Gain" [17] "Duration"

The post R Recipe: Reordering Columns in a Flexible Way appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on his blog: Exegetic Analytics » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

How to correctly set color in the image() function?

Fri, 2015-05-15 16:10

(This article was first published on One Tip Per Day, and kindly contributed to R-bloggers)

Sometimes we want to make our own heatmap using image() function. I recently found it's tricky to set the color option there, as its manual has very little information on col:


cola list of colors such as that generated by rainbowheat.colorstopo.colorsterrain.colors or similar functions.
I posted my question on BioStars. The short answer is: Unless the breaks is set, the range of Z is evenly cut into N intervals (where N = the length of color) and values in Z are assigned to the color of corresponding interval. For example, when x=c(3,1,2,1) and col=c("blue","red",'green','yellow'), the minimal of x is assigned as the first color, and max to the last color. Any value between is calculated proportionally to a color. In this case, 2 is the the middle one, according to the principal that intervals are closed on the right and open on the left, it's assigned to "red". So, that's why we see the colors are yellow-->blue-->red-->blue.

In practice, unless we want to manually define the color break points, we just set the first and last color, it will automatically find colors for the values in Z.

collist<-c(0,1)
image(1:ncol(x),1:nrow(x), as.matrix(t(x)), col=collist, asp=1)

If we want to manually define the color break points, we need to

x=matrix(rnorm(100),nrow=10)*100
xmin=0; xmax=100;
x[x<xmin]=xmin; x[x>xmax]=xmax;
collist<-c("#053061","#2166AC","#4393C3","#92C5DE","#D1E5F0","#F7F7F7","#FDDBC7","#F4A582","#D6604D","#B2182B","#67001F")
ColorRamp<-colorRampPalette(collist)(10000)
ColorLevels<-seq(from=xmin, to=xmax, length=10000)
ColorRamp_ex <- ColorRamp[round(1+(min(x)-xmin)*10000/(xmax-xmin)) : round( (max(x)-xmin)*10000/(xmax-xmin) )]
par(mar=c(2,0,2,0), oma=c(3,3,3,3))
layout(matrix(seq(2),nrow=2,ncol=1),widths=c(1),heights=c(3,0.5))
image(t(as.matrix(x)), col=ColorRamp_ex, las=1, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
image(as.matrix(ColorLevels),col=ColorRamp, xlab="",ylab="",cex.axis=1,xaxt="n",yaxt="n")
axis(1,seq(xmin,xmax,10),seq(xmin,xmax,10))

To leave a comment for the author, please follow the link and comment on his blog: One Tip Per Day. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Because it’s Friday: Love in the land of Facebook

Fri, 2015-05-15 15:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Today is my 11th wedding anniversary with my wonderful husband Jay, so it's a love-themed Friday post today. Jay and I met before Facebook was a thing, but we've been touched by the congratulations on our timelines today.

Those timeline posts reveal a lot about you and your relationships, and last year the Facebook data science team published a series of analyses with anonymized data (and done with R, of course) on what your Facebook activity says about your love life. For example, the way to tell if a two people are destined to be a couple (as marked by a joint "In a Relationship" status on Facebook) is to watch for a steadily increasing rate of shared timeline posts:

After the relationship is declared the rate of joint posts drops off dramatically (as, presumably, the couple is otherwise occupied), but those posts they do share become sweetly sentimental:

Read about other Facebook analyses of relationship status in the 6-part series.

That's all for this week. Have a great weekend, and we'll see you back here on Monday!

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

From JSON to Tables

Fri, 2015-05-15 14:49

(This article was first published on My Data Atelier » R, and kindly contributed to R-bloggers)

“First things first”. Tautological, always true. However, sometimes some data scientists seem to ignore this: you can think of using the most sophisticated and trendy algorithm, come up with brilliant ideas, imagine the most creative visualizations but, if you do not know how to get the data and handle it in the exact way you need it, all of this becomes worthless. In other words, first things first. In my professional experience, I have heard thousands of potenitally brilliant ideas which couldn’t be even tested because the data scientist in question did not know how to handle data according to his/her needs. This has become particularly problematic with the popularisation of JSON: despite the undeniable advantages that this data structure has in terms of data storage, replication, etc, it presents a challenge for data scientist, as most algorithms require that the input data is passed in a tabular form. I have myself faced to this problem and, after a couple of times of creating some temporary solution, felt it was high time to come up to a general one: of course, the solution I have come up to is just one approach. In order to understand it, two things from JSONs structure must be considered: Firstly, JSON objects are basically an array of objects. This means that, unlike traditional tabulated Databases there is no indication what they don´t have. This might seem trivial but if you want to tabulate JSONs it becomes certainly something to be solved. Secondly,  this array of objects can contain objects which are arrays themselves. This means that, when transforming this to a table, that value has to be transformed somehow to an appropiate format for a table. The function in question contemplates these two scenarios: it is intended to be used not only to tabulate objects with identical structure but also for objects whose internal structure is different Before deep diving into the code of the function, let us propose a scenario where this could be useful. Imagine we would like to tabulate certain information from professional football players. In this mini-example, we have chosen Lionel Messi, Cristiano Ronaldo and George Best. The information is retrieved from dbpedia, which exposes the corresponding JSONs of their wikipedia’s entries. In the code below, the first steps: library(rjson) library(RCurl) players <- c("Lionel_Messi","George_Best","Cristiano_Ronaldo") players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json")))) players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]]) players.unlist <- unlist(players.info) I have to admit that this is not the simplest example as the JSONs retrieved are extremely complex and need much more pre-processing than usual (for example, than most API calls). The problem is that nowadays most API calls need an API Key and so the example becomes less reproducible. Back to the example, firstly the data for each player is retrieved using getURL from RCurl and then converted to list by using fromJSON() from rjson, which converts json strings or files to lists. All the call is done inside a lapply statement, so the object returned is basically a list of JSONs converted to list, i.e., a list of lists. After that, from each of the sub-lists( in this case, 3, as there were three players) we are selecting only the piece of information that refers to the player himself; in a few words, a bit of cleanup. Finally, the file is unlisted. unlist() is a function which turns a list (a list of lists is a list itself) into a named vector of a certain type, to which all the elements can be coerced to; in this case, characters.

This is the point of manual intervention, where the variable selection has to be done. At this stage, it is highly recommended to have at least one example loaded in some JSON parser (JSON Viewer, or any browser’s JSONView add-in). The names of the vector generated by unlist() (“players.unlist” in the example) are a concatenation of the names of the lists that value was in, separated by “.”.

The elements in the character vector are ordered in the same way they appeared in the lists. So, the first thing to do is to establish where each of the cases start. In order to do so, the challenge now is to find a common pattern between the vector names which can identify where each of the elements start. This element does not necessarily have to be inside the final table. In this case I chose “surname\.value$”. However, I would recommend, in most of the cases, to choose the first element that appears in the object.

Similarly, the challenge now is to find common patterns in the names of the vector elements that will define each of the variables. In order to do that, the wisest would be to takle a look at the names that you have. In this case:

> unique(names(players.unlist)) [1] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.type" [2] "http://www.w3.org/1999/02/22-rdf-syntax-ns#type.value" [3] "http://www.w3.org/2002/07/owl#sameAs.type" [4] "http://www.w3.org/2002/07/owl#sameAs.value" [5] "http://www.w3.org/2000/01/rdf-schema#label.type" [6] "http://www.w3.org/2000/01/rdf-schema#label.value" [7] "http://www.w3.org/2000/01/rdf-schema#label.lang" [8] "http://purl.org/dc/terms/subject.type" [9] "http://purl.org/dc/terms/subject.value" [10] "http://xmlns.com/foaf/0.1/homepage.type" [11] "http://xmlns.com/foaf/0.1/homepage.value" [12] "http://xmlns.com/foaf/0.1/depiction.type" [13] "http://xmlns.com/foaf/0.1/depiction.value" [14] "http://purl.org/dc/elements/1.1/description.type" [15] "http://purl.org/dc/elements/1.1/description.value" [16] "http://purl.org/dc/elements/1.1/description.lang" [17] "http://xmlns.com/foaf/0.1/givenName.type" [18] "http://xmlns.com/foaf/0.1/givenName.value" [19] "http://xmlns.com/foaf/0.1/givenName.lang" [20] "http://xmlns.com/foaf/0.1/name.type" By taking a look at the JSON of any of these examples, we can see that “type” refers to the type of value of “value”. So, in this case, we know that we are going to need only the ones that end with “.value”. However, the JSONs are too large to do a complete scan of all its elements, so the wisest thing would be to look for the desired value manually. For example, if we would like to take the date of birth: > grep("(birth|Birth).*\.value",unique(names(players.unlist)),value=T) [1] "http://dbpedia.org/ontology/birthName.value" [2] "http://dbpedia.org/property/birthDate.value" [3] "http://dbpedia.org/property/birthPlace.value" [4] "http://dbpedia.org/property/dateOfBirth.value" [5] "http://dbpedia.org/property/placeOfBirth.value" [6] "http://dbpedia.org/ontology/birthDate.value" [7] "http://dbpedia.org/ontology/birthPlace.value" [8] "http://dbpedia.org/ontology/birthYear.value" [9] "http://dbpedia.org/property/birthName.value"

Now we know that there are several different birth dates fields. In this case, we should take a look manually and based upon that choose the one that better fits our needs.In this example, I chose 6 arbitrary variables, extracted its pattern and chose suitable variable names for them. Please notice that, for example, date of death should be empty for Messi and Ronaldo while number is empty in the case of George Best (players didn’t use to have a fixed number in the past). Apart from that, “goals” has multiple entries per player, as the json has one value per club. This is the end of the script:

st.obj <- "^.+surname\.value$" columns <- c("fullname\.value","ontology/height\.value", "dateOfBirth\.value","dateOfDeath\.value", "ontology/number\.value","property/goals.value$") colnames <- c("full.name","height","date.of.birth","date.of.death","number","goals") players.table <- tabulateJSON(players.unlist,st.obj,columns,colnames)

And this is the final result

> players.table full.name height date.of.birth date.of.death [1,] "Lionel Andrés Messi" "1.69" "1987-06-24+02:00" NA [2,] "George Best" "1.524" "1946-05-22+02:00" "2005-11-25+02:00" [3,] "Cristiano Ronaldo dos Santos Aveiro" "1.85" "1985-02-05+02:00" NA number goals [1,] "10" "6;5;242" [2,] NA "6;2;3;0;1;15;12;8;33;137;21" [3,] "7" "3;84;176" Finally, we get to the function. tabulateJSON() expects four parameters: an unlisted json (or whatever character vector that has similar characteristics as the ones produced by unlisting a JSON), a string that represents the name of the starting positions of the elements, a vector of characters (normally regex patterns) with the element names to be saught and finally the names to assign to those columns generated.

Now let’s take on how look tabulateJSON() works. Below, the entire code of the function:

tabulateJSON <- function (json.un, start.obj, columns, colnames) { if (length(columns) != length(colnames)) { stop("'columns' and 'colnames' must be the same length") } start.ind <- grep(start.obj, names(json.un)) col.indexes <- lapply(columns, grep, names(json.un)) col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind)) temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F)) collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind)) if(length(collapse.cols) > 0){ temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";"))) } matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames) matr$pos <- NULL names(matr) <- colnames matr <- as.matrix(matr) colnames(matr) <- colnames return(matr) }

How does it work? Firstly, it looks for the items that define a the start of an object based upon the pattern passed and return their indexes. These will be the delimiters.

start.ind <- grep(start.obj, names(json.un))

After that, it will look for the names value that match the pattern passed in “columns” and return the indexes. Those indexes, have to be assigned to a position (i.e., a row number), which is done with findInterval(). This function maps a number to an interval given cut points.

For each of the variables, a data frame with 2 variables is created, where the first one is the position and the second one the value itself, which is obtained by accessing the values by the column indexes in the character vector. As it will be seen later, this is done in this way because it might be possible that for a sought pattern (that is converted to a variable) there might be more than one match in a row. Of course, that becomes problematic when trying to generate a tabulated dataset. For that reason, it is possible that temp.frames contains data frames with different number of rows.

col.indexes <- lapply(columns, grep, names(json.un)) col.position <- lapply(1:length(columns), function(x) findInterval(col.indexes[[x]], start.ind)) temp.frames <- lapply(1:length(columns), function(x) data.frame(pos = col.position[[x]], ind = json.un[col.indexes[[x]]], stringsAsFactors = F)) After this, it is necessary to collapse those variables which contain multiple values per row. Firstly, it checks whether there is any of the elemnts in the list whose amount of rows is greater than the amount of delimiters (which define the amount of rows). In the case there is any, the function uses plyr’s ddply() to collapse by row specifier (pos). After this process all the data frames in temp.frames will be of equal length: collapse.cols <- which(sapply(temp.frames, nrow) > length(start.ind)) if(length(collapse.cols) > 0){ temp.frames[collapse.cols] <- lapply(temp.frames[collapse.cols], function(x) ddply(.data = x, .(pos), summarise, value = paste0(ind, collapse = ";"))) } Finally, all the elements in temp.frames are merged one with another, the column name is assigned and “pos” is erased, as it does not belong to the dataset to be done: matr <- Reduce(function(...) merge(...,all=T,by="pos"),temp.frames) matr$pos <- NULL names(matr) <- colnames matr <- as.matrix(matr) colnames(matr) <- colnames return(matr) Of course, using this function straightforward is a bit “uncomfortable”. For that reason, and depending your particular needs, you can include tabulateJSON() as part of a higher-level function. In this particular example, the function could receive the names of the and a list of high level variables, which will be mapped to a particular pattern, for example: getPlayerInfo <- function(players,variables){ players.info <- lapply(players, function(pl) fromJSON(getURL(paste0("http://dbpedia.org/data/",pl,".json")))) players.info <- lapply(1:length(players), function(x) players.info[[x]][[paste0("http://dbpedia.org/resource/",players[x])]]) players.unlist <- unlist(players.info) st.obj <- "^.+surname\.value$" columns.to.grep <- paste0(variables,"\.value$") #Check if there is a multiple match with different types col.grep <- lapply(columns.to.grep, grep, x=unique(names(players.unlist))) columns <- sapply(col.grep, function(x) unique(names(players.unlist))[x[1]]) #Convert names to a regex columns <- gsub("\.","\\\.",columns) columns <- paste0("^",columns,"$") players.table <- tabulateJSON(players.unlist,st.obj,columns,variables) return(players.table) }

So, the call will be:

getPlayerInfo(c("Lionel_Messi","David_Beckham","Zinedine_Zidane","George_Best","George_Weah"),c("fullname","height","dateOfBirth","dateOfDeath","number","goals")) fullname height dateOfBirth [1,] "Lionel Andrés Messi" "1.69" "1987-06-24+02:00" [2,] "David Robert Joseph Beckham" "1.8288" "1975-05-02+02:00" [3,] "Zinedine Yazid Zidane" "1.85" "1972-06-23+02:00" [4,] "George Best" "1.524" "1946-05-22+02:00" [5,] "George Tawlon Manneh;Oppong Ousman Weah" "1.84" "1966-10-01+02:00" dateOfDeath number goals [1,] NA "10" "6;5;242" [2,] NA NA "62;2;0;13;18" [3,] NA NA "6;37;28;24" [4,] "2005-11-25+02:00" NA "6;2;3;0;1;15;12;8;33;137;21" [5,] NA NA "46;47;32;24;14;13;7;5;3;1"

I hope you enjoyed it and/or found it useful at least ;)

As usual, if you have any comments, suggestions, critics, please drop me a line


To leave a comment for the author, please follow the link and comment on his blog: My Data Atelier » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

In-database R coming to SQL Server 2016

Fri, 2015-05-15 13:03

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R is coming to SQL Server. SQL Server 2016 (which will be in public preview this summer) will include new real-time analytics, automatic data encryption, and the ability to run R within the database itself:

For deeper insights into data, SQL Server 2016 expands its scope beyond transaction processing, data warehousing and business intelligence to deliver advanced analytics as an additional workload in SQL Server with proven technology from Revolution Analytics.  We want to make advanced analytics more accessible and increase performance for your advanced analytic workloads by bringing R processing closer to the data and building advanced analytic capabilities right into SQL Server.  Additionally, we are building PolyBase into SQL Server, expanding the power to extract value from unstructured and structured data using your existing T-SQL skills. With this wave, you can then gain faster insights through rich visualizations on many devices including mobile applications on Windows, iOS and Android.

With this update, data scientists will no longer need to extract data from SQL server via ODBC to analyze it with R. Instead, you will be able to take your R code to the data, where is will be run inside a sandbox process within SQL Server itself. This eliminates the time and storage required to move the data, and gives you all the power of R and CRAN packages to apply to your database.

At last weeks' Microsoft Ignite conference in Chicago, SQL Server program managers Lindsey Allen and Borko Novakovic demonstrated a prototype of running R within SQL Server. (A description of the intergration begins at 57:00, and the demo at 1:05:00, in the video below.)  In the demo, Lindsey applies a Naive Bayes classification model (from the e1071 R package) to the famous Iris data, using the same R code used in this Azure ML Studio experiment.

 

SQL Server 2016 is the first Microsoft product to integrate Revolution R (and there are more exciting announcements on that front to come -- stay tuned). This also brings R to other Microsoft products via their native SQL Server integration, including Excel and PowerBI. Read more about the features coming to SQL Server 2016, including Revolution R integration, at the link below. 

SQL Server Blog: SQL Server 2016 public preview coming this summer

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Interactive maps for the web in R

Fri, 2015-05-15 12:54

(This article was first published on R Video tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

Static Maps
In the last post I showed how to download economic data from the World Bank's website and create choropleth maps in R (Global Economic Maps).
In this post I want to focus more on how to visualize those maps.

Sp Package
Probably the simplest way of plotting choropleth maps in R is the one I showed in the previous post, using the function ssplot(). For example with a call like the following:

library(sp)
spplot(polygons,"CO2",main=paste("CO2 Emissions - Year:",CO2.year),sub="Metric Tons per capita")
This function takes the object polygons, which is a SpatialPolygonsDataFrame, and in quotations marks the name of the column where to find the values to assign, as colors, to each polygon. This are two basic mandatory elements to call the function. However, we can increase the information in the plot by adding additional elements, such as a title, option main, and a sub title, option sub.

This function creates the following plot:

The color scale is selected automatically and can be changed with a couple of standard functions or using customized color scales. For more information please refer to this page: http://rstudio-pubs-static.s3.amazonaws.com/7202_3145df2a90a44e6e86a0637bc4264f9f.html


Standard Plot
Another way of plotting maps is by using the standard plot() function, which allows us to increase the flexibility of the plot, for example by customizing the position of the color legend.
The flip side is that, since it is the most basic plotting function available and it does not have many built-in options, this function require more lines of code to achieve a good result.
Let's take a look at the code below:

library(plotrix)
CO2.dat <- na.omit(polygons$CO2)
colorScale <- color.scale(CO2.dat,color.spec="rgb",extremes=c("red","blue"),alpha=0.8)
 
colors.DF <- data.frame(CO2.dat,colorScale)
colors.DF <- colors.DF[with(colors.DF, order(colors.DF[,1])), ]
colors.DF$ID <- 1:nrow(colors.DF)
breaks <- seq(1,nrow(colors.DF),length.out=10)
 
 
jpeg("CO2_Emissions.jpg",7000,5000,res=300)
plot(polygons,col=colorScale)
title("CO2 Emissions",cex.main=3)
 
legend.pos <- list(x=-28.52392,y=-20.59119)
legendg(legend.pos,legend=c(round(colors.DF[colors.DF$ID %in% round(breaks,0),1],2)),fill=paste(colors.DF[colors.DF$ID %in% round(breaks,0),2]),bty="n",bg=c("white"),y.intersp=0.75,title="Metric tons per capita",cex=0.8)
 
dev.off()

By simpling calling the plot function with the object polygons, R is going to create an image of the country borders with no filling. If we want to add colors to the plot we first need to create a color scale using our data. To do so we can use the function color.scale() in the package plotrix, which I used also in the post regarding the visualization seismic events from USGS (Downloading and Visualizing Seismic Events from USGS ). This function takes a vector, plus the color of the extremes of the color scale, in this case red and blue, and creates a vector of intermediate colors to assign to each element of the data vector.
In this example I first created a vector named CO2.dat, with the values of CO2 for each polygon, excluding NAs. Then I feed it to the color.scale() function.

The next step is the creation of the legend. The first step is the creation of the breaks we are going to need to present the full spectrum of colors used in the plot. For this I first created a data.frame with values and colors and then I subset it into 10 elements, which is the length of the legend.
Now we can submit the rest of the code to create a plot and the legend and save it into the jpeg file below:



Interactive Maps
The maps we created thus far are good for showing our data on papers and allow the reader to have a good understanding of what we are trying to show with them. Clearly these are not the only two methods available to create maps in R, many more are available. In particular ggplot2 now features ways of creating beautiful static maps. For more information please refer to these website and blog posts:
Maps in R
Making Maps in R
Introduction to Spatial Data and ggplot2
Plot maps like a boss
Making Maps with R

In this post however, I would like to focus on ways to move away from static maps and embrace the fact that we are now connected to the web all the times. This allow us to create maps specifically design for the web, which can also be much more easy to read by the general public that is used to them.
These sort of maps are the interactive maps that we see all over the web, for example from Google. These are created in javascript and are extremely powerful. The problem is, we know R and we work on it all the time, but we do not necessarily know how to code in javascript. So how can we create beautiful interactive maps for the web if we cannot code in javascript and HTML?

Luckily for us developers have create packages that allow us to create maps using standard R code but n the form of HTML page that we can upload directly on our website. I will now examine the packages I know and use regularly for plotting choropleth maps.


googleVis
This package is extremely simple to use and yet capable of creating beautiful maps that can be uploaded easily to our website.
Let's look at the code below:

data.poly <- as.data.frame(polygons)
data.poly <- data.poly[,c(5,12)]
names(data.poly) <- c("Country Name","CO2 emissions (metric tons per capita)")
 
map <- gvisGeoMap(data=data.poly, locationvar = "Country Name", numvar='CO2 emissions (metric tons per capita)',options=list(width='800px',heigth='500px',colors="['0x0000ff', '0xff0000']"))
plot(map)
 
print(map,file="Map.html")
 
#http://www.javascripter.net/faq/rgbtohex.htm
#To find HEX codes for RGB colors
The first thing to do is clearly to load the package googleVis. Then we have to transform the SpatialPolygonsDataFrame into a standard data.frame. Since we are interested in plotting only the data related to the CO2 emissions for each country (as far as I know with this package we can plot only one variable for each map), we can subset the data.frame, keeping only the column with the names of each country and the one with the CO2 emissions. Then we need to change the names of these two columns so that the user can readily understand what he is looking at.
Then we can simply use the function gvisGeoMap() to create a choropleth maps using the Google Visualisation API. This function does not read the coordinates from the object, but we need to provide the names of the geo locations to use with the option locationvar, in this case the Google Visualisation API will take the names of the polygons and match them to the geometry of the country. Then we have the option numvar, which takes the name of the column where to find the data for each country. Then we have options, where we can define various customizations available in the Google Visualisation API and provided at this link: GoogleVis API Options
In this case I specified the width and height of the map, plus the two color extremes to use for the color scale. 
The result is the plot below:


This is an interactive plot, meaning that if I hover the mouse over a country the map will tell me the name and the amount of CO2 emitted. This map is generated directly from R but it all written in HTML and javascript. We can use the function print(), presented in the snippet above, to save the map into an HTML file that can upload as is on the web.
The map above is accessible from this link:  GoogleVis Map


plotGoogleMaps
This is another great package that harness the power of Google's APIs to create intuitive and fully interactive web maps. The difference between this and the previous package is that here we are going to create interactive maps using the Google Maps API, which is basically the one you use when you look up a place on Google Maps.
Again this API uses javascript to create maps and overlays, such as markers and polygons. However, with this package we can use very simple R code and create stunning HTML pages that we can just upload to our websites and share with friends and colleagues.

Let's look at the following code:

library(plotGoogleMaps)
 
polygons.plot <- polygons[,c("CO2","GDP.capita","NAME")]
polygons.plot <- polygons.plot[polygons.plot$NAME!="Antarctica",]
names(polygons.plot) <- c("CO2 emissions (metric tons per capita)","GDP per capita (current US$)","Country Name")
 
#Full Page Map
map <- plotGoogleMaps(polygons.plot,zoom=4,fitBounds=F,filename="Map_GoogleMaps.html",layerName="Economic Data")
 
 
#To add this to an existing HTML page
map <- plotGoogleMaps(polygons.plot,zoom=2,fitBounds=F,filename="Map_GoogleMaps_small.html",layerName="Economic Data",map="GoogleMap",mapCanvas="Map",map.width="800px",map.height="600px",control.width="200px",control.height="600px")
Again there is a bit of data preparation to make. Again we need to subset the polygons dataset and keep only the variable we would need for the plot (with this package this is not mandatory, but it is maybe good to avoid large objects). Then we have to exclude Antarctica, otherwise the interactive map will have some problems (you can try and leave it to see what it does and maybe figure out a way to solve it). Then again we change the names of the columns in order to be more informative.

At this point we can use the function plotGoogleMaps() to create web maps in javascript. This function is extremely simple to use, it just takes one argument, which is the data.frame and creates a web map (R opens the browser to show the output). There are clearly ways to customize the output, for example by choosing a level of zoom (in this case the fiBounds option needs to be set to FALSE). We can also set a layerName to show in the legend, which is automatically created by the function.
Finally, because we want to create an HTML file to upload to our website, we can use the option filename to save it.
The result is a full screen map like the one below:


This map is available here: GoogleMaps FullScreen


With this function we also have ways to customize not only the map itself but also the HTML page so that we can later add information to it. In the last line of the code snippet above you can see that I added the following options to the function plotGoogleMaps():

mapCanvas="Map",map.width="800px",map.height="600px",control.width="200px",control.height="600px"

These options are intended to modify the aspect of the map on the web page, for example its width and height, and the aspect of the legend and controls, with controls.width and controls.height. We can also add the id of the HTML <div> element that will contain the final map.
If we have some basic experience with HTML we can then open the file and tweak a bit, for example by shifting map and legend to the center and adding a title and some more info.


This map is available here: GoogleMaps Small

The full code to replicate this experiment is presented below:

#Methods to Plot Choropleth Maps in R
load(url("http://www.fabioveronesi.net/Blog/polygons.RData"))

#Standard method
#SP PACKAGE
library(sp)
spplot(polygons,"CO2",main=paste("CO2 Emissions - Year:",CO2.year),sub="Metric Tons per capita")


#PLOT METHOD
library(plotrix)
CO2.dat <- na.omit(polygons$CO2)
colorScale <- color.scale(CO2.dat,color.spec="rgb",extremes=c("red","blue"),alpha=0.8)

colors.DF <- data.frame(CO2.dat,colorScale)
colors.DF <- colors.DF[with(colors.DF, order(colors.DF[,1])), ]
colors.DF$ID <- 1:nrow(colors.DF)
breaks <- seq(1,nrow(colors.DF),length.out=10)


jpeg("CO2_Emissions.jpg",7000,5000,res=300)
plot(polygons,col=colorScale)
title("CO2 Emissions",cex.main=3)

legend.pos <- list(x=-28.52392,y=-20.59119)
legendg(legend.pos,legend=c(round(colors.DF[colors.DF$ID %in% round(breaks,0),1],2)),fill=paste(colors.DF[colors.DF$ID %in% round(breaks,0),2]),bty="n",bg=c("white"),y.intersp=0.75,title="Metric tons per capita",cex=0.8)

dev.off()





#INTERACTIVE MAPS
#googleVis PACKAGE
library(googleVis)

data.poly <- as.data.frame(polygons)
data.poly <- data.poly[,c(5,12)]
names(data.poly) <- c("Country Name","CO2 emissions (metric tons per capita)")

map <- gvisGeoMap(data=data.poly, locationvar = "Country Name", numvar='CO2 emissions (metric tons per capita)',options=list(width='800px',heigth='500px',colors="['0x0000ff', '0xff0000']"))
plot(map)

print(map,file="Map.html")

#http://www.javascripter.net/faq/rgbtohex.htm
#To find HEX codes for RGB colors





#plotGoogleMaps
library(plotGoogleMaps)

polygons.plot <- polygons[,c("CO2","GDP.capita","NAME")]
polygons.plot <- polygons.plot[polygons.plot$NAME!="Antarctica",]
names(polygons.plot) <- c("CO2 emissions (metric tons per capita)","GDP per capita (current US$)","Country Name")

#Full Page Map
map <- plotGoogleMaps(polygons.plot,zoom=4,fitBounds=F,filename="Map_GoogleMaps.html",layerName="Economic Data")


#To add this to an existing HTML page
map <- plotGoogleMaps(polygons.plot,zoom=2,fitBounds=F,filename="Map_GoogleMaps_small.html",layerName="Economic Data",map="GoogleMap",mapCanvas="Map",map.width="800px",map.height="600px",control.width="200px",control.height="600px")



R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R Video tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

U.S. Drought Monitoring With Hexbin State Maps in R

Fri, 2015-05-15 08:18

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

On the news, today, of the early stages of drought hitting the U.S. northeast states I decided to springboard off of yesterday’s post and show a more practical use of hexbin state maps than the built-in (and still purpose unknown to me) “bees” data.

The U.S. Drought Monitor site supplies more than just a pretty county-level map. There’s plenty of data and you can dynamically retrieve just data tables for the whole U.S., U.S. states and U.S. counties. Since we’re working with state hexbins, we just need the state-level data. Drought levels for all five stages are reported per-state, so we can take all this data and created a faceted/small-multiples map based on it.

This builds quite a bit on the previous work, so you’ll see some familiar code. Most of the new code is actually making the map look nice (the great part about this is that once you have the idiom down, it’s just a matter of running the script each day vs a billion mouse clicks). The other bit of new code is the data-retrieval component:

library(readr) library(tidyr) library(dplyr)   intensity <- c(D0="Abnormally Dry", D1="Moderate Drought", D2="Severe Drought", D3="Extreme Drought", D4="Exceptional Drought")   today <- format(Sys.Date(), "%Y%m%d")   read_csv(sprintf("http://droughtmonitor.unl.edu/USDMStatistics.ashx/?mode=table&aoi=state&date=%s", today)) %>% gather(drought_level, value, D0, D1, D2, D3, D4) %>% mutate(intensity=factor(intensity[drought_level], levels=as.character(intensity), ordered=TRUE)) -> drought

This:

  • sets up a fast way to add the prettier description of the drought levels (besides D0, D1, etc)
  • dynamically uses today’s date as the parameter for the URL we read with read_csv (from the readr package)
  • covert the data from wide to long
  • adds the intensity description

The ggplot code will facet on the intensity level to make the overall map:

library(rgdal) library(rgeos) library(ggplot2) library(readr) library(tidyr) library(dplyr) library(grid)   us <- readOGR("us_states_hexgrid.geojson", "OGRGeoJSON")   centers <- cbind.data.frame(data.frame(gCentroid(us, byid=TRUE), id=us@data$iso3166_2))   us_map <- fortify(us, region="iso3166_2")   intensity <- c(D0="Abnormally Dry", D1="Moderate Drought", D2="Severe Drought", D3="Extreme Drought", D4="Exceptional Drought")   today <- format(Sys.Date(), "%Y%m%d")   read_csv(sprintf("http://droughtmonitor.unl.edu/USDMStatistics.ashx/?mode=table&aoi=state&date=%s", today)) %>% gather(drought_level, value, D0, D1, D2, D3, D4) %>% mutate(intensity=factor(intensity[drought_level], levels=as.character(intensity), ordered=TRUE)) -> drought   gg <- ggplot() gg <- gg + geom_map(data=us_map, map=us_map, aes(x=long, y=lat, map_id=id), color="white", size=0.5) gg <- gg + geom_map(data=drought, map=us_map, aes(fill=value, map_id=State)) gg <- gg + geom_map(data=drought, map=us_map, aes(map_id=State), fill="#ffffff", alpha=0, color="white", show_guide=FALSE) gg <- gg + geom_text(data=centers, aes(label=id, x=x, y=y), color="white", size=4) gg <- gg + scale_fill_distiller(name="StatenDroughtnCoverage", palette="RdPu", na.value="#7f7f7f", labels=sprintf("%d%%", c(0, 25, 50, 75, 100))) gg <- gg + coord_map() gg <- gg + facet_wrap(~intensity, ncol=2) gg <- gg + labs(x=NULL, y=NULL, title=sprintf("U.S. Drought Conditions as of %sn", Sys.Date())) gg <- gg + theme_bw() gg <- gg + theme(plot.title=element_text(face="bold", hjust=0, size=24)) gg <- gg + theme(panel.border=element_blank()) gg <- gg + theme(panel.margin=unit(3, "lines")) gg <- gg + theme(panel.grid=element_blank()) gg <- gg + theme(axis.ticks=element_blank()) gg <- gg + theme(axis.text=element_blank()) gg <- gg + theme(strip.background=element_blank()) gg <- gg + theme(strip.text=element_text(face="bold", hjust=0, size=14)) gg <- gg + theme(legend.position=c(0.75, 0.15)) gg <- gg + theme(legend.direction="horizontal") gg <- gg + theme(legend.title.align=1)   png(sprintf("%s.png", today), width=800, height=800) print(gg) dev.off()

Now, you can easily animate these over time to show the progression/regression of the drought conditions. If you’re sure your audience can work with SVG files, you can use those for very crisp/sharp maps (and even feed it to D3 or path editing tools). If you have an example of how you’re using hexbin choropleths, drop a note in the comments. The code from above is also on github.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Recent Common Ancestors: Simple Model

Fri, 2015-05-15 07:35

(This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers)

An interesting paper (Modelling the recent common ancestry of all living humans, Nature, 431, 562–566, 2004) by Rohde, Olson and Chang concludes with the words:

Further work is needed to determine the effect of this common ancestry on patterns of genetic variation in structured populations. But to the extent that ancestry is considered in genealogical rather than genetic terms, our findings suggest a remarkable proposition: no matter the languages we speak or the colour of our skin, we share ancestors who planted rice on the banks of the Yangtze, who first domesticated horses on the steppes of the Ukraine, who hunted giant sloths in the forests of North and South America, and who laboured to build the Great Pyramid of Khufu.

This paper considered two models of our genealogical heritage. Neither of the models was exhaustive and they could not take into account all of the complexities involved in determining the genealogy of the Earth's population. However, they captured many of the essential features. And despite their relative simplicity, the models reveal an unexpected result: a genealogical Common Ancestor (CA) of the entire present day human population would have lived in the relatively recent past.

It would be interesting to replicate some of those results.

An Even Simpler Model

In an earlier paper, Recent common ancestors of all present-day individuals, Chang considered a simpler model which he described as follows:

We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each generation, each individual chooses two parents at random from the previous generation. The choices are made just as in the standard Wright–Fisher model (randomly and equally likely over the n possibilities) the only difference being that here each individual chooses twice instead of once. All choices are made independently. Thus, for example, it is possible that when an individual chooses his two parents, he chooses the same individual twice, so that in fact he ends up with just one parent; this happens with probability 1/n.

At first sight this appears to be an extremely crude and abstract model. However it captures many of the essential features required for modelling genealogical inheritance. It also has the advantage of not being too difficult to implement in R. If these details are not your cup of tea, feel free to skip forward to the results.

First we'll need a couple of helper functions to neatly generate labels for generations (G) and individuals (I).

> label.individuals <- function(N) { + paste("I", str_pad(1:N, 2, pad = "0"), sep = "") + } > > make.generation <- function(N, g) { + paste(paste0("G", str_pad(g, 2, pad = "0")), label.individuals(N), sep = "/") + } > > make.generation(3, 5) [1] "G05/I01" "G05/I02" "G05/I03"

Each indivual is identified by a label of the form "G05/I01", which gives the generation number and also the specific identifier within that generation.

Next we'll have a function to generate the random links between individuals in successive generations. The function will take two arguments: the number of individuals per generation and the number of ancestor generations (those prior to the "current" generation).

> # N - number of people per generation > # G - number of ancestor generations (there will be G+1 generations!) > # > generate.edges <- function(N, G) { + edges = lapply(0:(G-1), function(g) { + children <- rep(make.generation(N, g), each = 2) + # + parents <- sample(make.generation(N, g+1), 2 * N, replace = TRUE) + # + data.frame(parents = parents, children = children, stringsAsFactors = FALSE) + }) + do.call(rbind, edges) + } > > head(generate.edges(3, 3), 10) parents children 1 G01/I02 G00/I01 2 G01/I01 G00/I01 3 G01/I01 G00/I02 4 G01/I02 G00/I02 5 G01/I01 G00/I03 6 G01/I03 G00/I03 7 G02/I02 G01/I01 8 G02/I01 G01/I01 9 G02/I01 G01/I02 10 G02/I02 G01/I02

So, for example, the data generated above links the child node G00/I01 (individual 1 in generation 0) to two parent nodes, G01/I02 and G01/I01 (individuals 1 and 2 in generation 1).

Finally we'll wrap all of that up in a graph-like object.

> library(igraph) > generate.nodes <- function(N, G) { + lapply(0:G, function(g) make.generation(N, g)) + } > > generate.graph <- function(N, G) { + nodes = generate.nodes(N, G) + # + net <- graph.data.frame(generate.edges(N, G), directed = TRUE, vertices = unlist(nodes)) + # + # Edge layout + # + x = rep(1:N, G+1) + y = rep(0:G, each = N) + # + net$layout = cbind(x, y) + # + net$individuals = label.individuals(N) + net$generations = nodes + # + net + }

Let's give it a whirl on a graph consisting of 10 ancestor generations (11 generations including the current generation) and 8 individuals per generation. The result is plotted below with the oldest generation (G10) at the top and the current generation (G00) at the bottom. The edges indicate parent/progeny links.

Okay, so it looks like all of the infrastructure is in place. Now we need to put together the functionality for analysing the relationships between generations. We are really only interested in how any one of the ancestor generations relates to the current generation, which makes things a little simpler. First we write a function to calculate the number of steps from an arbitrary node (identified by label) to each of the nodes in the current generation. This is simplified by the fact that igraph already has a shortest.path() function.

> generation.descendants <- function(net, node) { + present.generation <- first(net$generations) + # + shortest.paths(net, v = node, to = present.generation, mode = "out") + }

Next a pair of helper functions which indicate whether a particular node is a CA of all nodes in the current generation or if it has gone extinct. Node G03/I08 in the above graph is an example of an extinct node since it has no child nodes.

> common.ancestor <- function(net, node) { + all(is.finite(generation.descendants(net, node))) + } > > extinct <- function(net, node) { + all(is.infinite(generation.descendants(net, node))) + }

Let's test those. We'll generate another small graph for the test.

> set.seed(1) > # > net <- generate.graph(3, 5) > # > # This is an ancestor of all. > # > generation.descendants(net, "G05/I01") G00/I01 G00/I02 G00/I03 G05/I01 5 5 5 > common.ancestor(net, "G05/I01") [1] TRUE > # > # This is an ancestor of all but G00/I02 > # > generation.descendants(net, "G02/I03") G00/I01 G00/I02 G00/I03 G02/I03 2 Inf 2 > common.ancestor(net, "G02/I03") [1] FALSE > # > # This node is extinct. > # > generation.descendants(net, "G03/I01") G00/I01 G00/I02 G00/I03 G03/I01 Inf Inf Inf > extinct(net, "G03/I01") [1] TRUE

It would also be handy to have a function which gives the distance from every node in a particular generation to each of the nodes in the current generation.

> generation.distance <- function(net, g) { + current.generation <- net$generations[[g+1]] + # + do.call(rbind, lapply(current.generation, function(n) {generation.descendants(net, n)})) + } > generation.distance(net, 3) G00/I01 G00/I02 G00/I03 G03/I01 Inf Inf Inf G03/I02 3 3 3 G03/I03 3 3 3

So here we see that G03/I02 and G03/I03 are both ancestors of each of the individuals in the current generation, while G03/I01 has gone extinct.

Finally we can pull all of this together with a single function that classifies individuals in previous generations according to whether they are an ancestor of at least one but not all of the current generation; a common ancestor of all of the current generation; or extinct.

> classify.generation <- function(net, g) { + factor(apply(generation.distance(net, g), 1, function(p) { + ifelse(all(is.infinite(p)), 1, ifelse(all(is.finite(p)), 2, 0)) + }), levels = 0:2, labels = c("ancestor", "extinct", "common")) + } > classify.generation(net, 3) G03/I01 G03/I02 G03/I03 extinct common common Levels: ancestor extinct common Results

First let's have a look at how the relationship between previous generations and the current generation changes over time. The plot below indicates the proportion of ancestors, common ancestors and extinct individuals as a function of generation number. Note that time runs backwards along the horizontal axis, with the current generation on the left and successive ancestor generations moving towards the right.

The graph used to generate these data had 1000 individuals per generation. We can see that for the first 10 generations around 80% of the individuals were ancestors of one or more (but not all!) of the current generation. The remaining 20% or so were extinct. Then in generation 11 we have the first CA. This individual would be labelled the Most Recent Common Ancestor (MRCA). Going further back in time the fraction of CAs increases while the proportion of mere ancestors declines until around generation 19 when all of the ancestors are CAs.

There are two important epochs to identify on the plot above:

  1. the generation at which the MRCA emerged (this is denoted T_n); and
  2. the generation at which all individuals are either CAs or extinct (this is denoted U_n).

Chang's paper gives asymptotic expressions for how these epochs vary as a function of generation size. It also presents a table with sample values of T_n and U_n for generations of size 500, 1000, 2000 and 4000. Those samples support the asymptotic relationships.

Since we have all of the infrastructure in place, we can conduct an independent test of the relationships. Below is a boxplot generated from 100 realisations each of calculations with generations of size 10, 100 and 1000. The blue boxes reflect our estimates for T_n, while the green boxes are the estimates of U_n. The two dashed lines reflect the asymptotic relationships. There's pretty good agreement between the theoretical and experimental results, although our estimates for U_n appear to be consistently larger than expected. The sample size was limited though... Furthermore, these are asymptotic relationships, so we really only expect them to hold exactly for large generation sizes.

When would our MRCA have been alive based on this simple model and the Earth's current population? Setting the generation size to the Earth's current population of around 7 billion people gives T_n at about 33 generations and a U_n of approximately 58 generations. I know, those 7 billion people are not from the same generation, but we are going for an estimate here...

What does a generation translate to in terms of years? That figure has varied significantly over time. It also depends on location. However, in developed nations it is currently around 30 years. WolframAlpha equates a generation to 28 years. So, based on these estimates we have T_n at around 1000 years and our estimate of U_n translates into less than 2 millennia.

Of course, these numbers need to be taken with a pinch of salt since they are based on a model which assumes that the population has been static at around 7 billion, which we know to be far from true! However these results are also not completely incompatible with the more sophisticated results of Rohde, Olson and Chang who found that the MRCA of all present-day humans lived just a few thousand years ago.

The post Recent Common Ancestors: Simple Model appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on his blog: Exegetic Analytics » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Convert data.tree to and from list, json, networkD3, and more

Fri, 2015-05-15 02:29

(This article was first published on ipub » R, and kindly contributed to R-bloggers)

The data.tree package provides a facility to create and manage tree structures in R, and to decorate nodes with methods. This comes in handy for many algorithms. For an example, see here. Or read up the vignettes in the package, e.g. from CRAN.

In the last post, I showed how we can write a custom converter to convert a data.tree Node to a JSON string. For standard conversions, this has just become extremely simple. As of today, data.tree contains generic methods to convert a data.tree from and to a list.

Convert a Node to a list

Converting a data.tree Node to a list is really simple:

library(data.tree) data(acme) l <- acme$ToList()

Alternatively, you might prefer the “traditional” generic method:

as.list(acme)

Convert a Node to JSON

Conversion to JSON becomes trivial:

library(rjson) cat(toJSON(l))

The result is (doing some pretty-printing):

{ "name": "Acme Inc.", "children": { "Accounting": { "name": "Accounting", "children": { "New Software": { "name": "New Software", "cost": 1000000, "p": 0.5 }, "New Accounting Standards": { "name": "New Accounting Standards", "cost": 500000, "p": 0.75 } } }, "Research": { "name": "Research", "children": { "New Product Line": { "name": "New Product Line", "cost": 2000000, "p": 0.25 }, "New Labs": { "name": "New Labs", "cost": 750000, "p": 0.9 } } }, "IT": { "name": "IT", "children": { "Outsource": { "name": "Outsource", "cost": 400000, "p": 0.2 }, "Go agile": { "name": "Go agile", "cost": 250000, "p": 0.05 }, "Switch to R": { "name": "Switch to R", "cost": 50000, "p": 1 } } } } }

If we prefer our children to be in an array, we can use the unname flag. This will cause the nested children elements to be unnamed in the list:

cat(toJSON(acme$ToList(unname = TRUE)))

This yields:

{ "name": "Acme Inc.", "children": [ { "name": "Accounting", "children": [ { "name": "New Software", "cost": 1000000, "p": 0.5 }, { "name": "New Accounting Standards", "cost": 500000, "p": 0.75 } ] }, { "name": "Research", "children": [ { "name": "New Product Line", "cost": 2000000, "p": 0.25 }, { "name": "New Labs", "cost": 750000, "p": 0.9 } ] }, { "name": "IT", "children": [ { "name": "Outsource", "cost": 400000, "p": 0.2 }, { "name": "Go agile", "cost": 250000, "p": 0.05 }, { "name": "Switch to R", "cost": 50000, "p": 1 } ] } ] }

There are multiple options in the ToList method. Type the following to learn more:

?ToList ?as.list.Node

Namely, you can call the reserved-word “name” and “children” differently when importing and exporting, if you want.

Convert a list to a Node

As a side note: you can also convert a list to a Node:

as.Node(l)

## levelName ## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software ## 4 ¦ °--New Accounting Standards ## 5 ¦--Research ## 6 ¦ ¦--New Product Line ## 7 ¦ °--New Labs ## 8 °--IT ## 9 ¦--Outsource ## 10 ¦--Go agile ## 11 °--Switch to R

 

networkD3

This opens endless possibilities for integration with other packages. For example, to draw a tree with the networkD3 package:

library(networkD3) treeNetwork(acme$ToList(unname = TRUE))

The result is here. Not very meaningful, I agree, but you get the idea.

Note that the ToList feature is not yet rolled out to CRAN, but you can easily try it out by downloading it from github:

devtools::install_github("gluc/data.tree")

Next, I’ll be working on converting a data.tree to a dendrogram for plotting, and to a party object (from the partykit package) for access to partinioning algorithms.

The post Convert data.tree to and from list, json, networkD3, and more appeared first on ipub.

To leave a comment for the author, please follow the link and comment on his blog: ipub » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New in forecast 6.0

Fri, 2015-05-15 01:01

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

This week I uploaded a new version of the forecast package to CRAN. As there were a lot of changes, I decided to increase the version number to 6.0.

The changes are all outlined in the ChangeLog file as usual. I will highlight some of the more important changes since v5.0 here.

ETS

One of the most used functions in the package is ets() and it provides a stock forecasting engine for many organizations. The default model selection is now restricted to exclude multiplicative trend models as these often give very poor forecasts due to the extrapolation of exponential trends. Multiplicative trend models can still be fitted if required. I compared the new default settings with the old defaults on the M3 data, and found a considerable difference in forecast accuracy:

MAPE sMAPE MASE ETS 17.38 13.13 1.43 ETS (old) 18.04 13.36 1.52 AutoARIMA 19.12 13.85 1.47

Here “ETS” denotes the new default approach (without multiplicative trends) and “ETS (old)” is the old default approach including possibly multiplicative trends. For comparison, the results from applying auto.arima to the same data are also shown.

ARIMA

The auto.arima() function is now stricter on near unit-roots. Even if a model can be estimated, it will not be selected if the characteristic AR or MA roots are too close to the unit circle. This prevents occasional numerical instabilities occurring. Previously the roots had to be at least 0.001 away from the unit circle. Now they have to be at least 0.01 from the unit circle.

There is a new allowmean argument in auto.arima which can be used to prevent a mean term being included in a model.

There is a new plot.Arima() function which plots the characteristic roots of an ARIMA model. This is based on a blog post I wrote last year.

It is now possible to easily obtain the fitted model order for use in other functions. The function arimaorder applied to a fitted ARIMA model (such as that returned by auto.arima) will return a numeric vector of the form (p,d,q) for a nonseasonal model and (p,d,q,P,D,Q,m) for a seasonal model. Similarly, as.character applied to the object returned by Arima or auto.arima will give a character string with the fitted model, suitable for use in plotting or reports.

TBATS/BATS

The models returned by tbats and bats were occasionally unstable. This problem has been fixed, again by restricting the roots to be further away from the unit circle.

STL

stlf and forecast.stl combine forecasting with seasonal decomposition. The seasonally adjusted series are forecast, and then the forecasts are re-seasonalized. These functions now have a forecastfunction argument to allow user-specified methods to be used in the forecasting step.

There is a new stlm function and a corresponding forecast.stlm function to allow the model estimation to be separated from the forecasting, thus matching most other forecasting methods in the package. This allows more flexible specification of the model to be used for the seasonally adjusted series.

ACF/PACF

The Acf function replaces the acf function to provide better plots of the autocorrelation function. The horizontal axis now highlights the seasonal lags.

I have added two new functions taperedacf and taperedpacf to implement the estimates and plots proposed in this recent paper.

Seasonality

The fourier() and fourierf() functions produce a matrix of Fourier terms for use in regression models for seasonal time series. These were updated to work with msts objects so that multiple seasonalities can be fitted.

Occasionally, the period of the seasonality may not be known. The findfrequency() function will estimate it. This is based on an earlier version I wrote for this blog post.

Automatic forecasting

The forecast.ts() function takes a time series and returns some forecasts, without the user necessarily knowing what is going on under the hood. It will use ets with default settings (if the data are non-seasonal or the seasonal period is 12 or less) and stlf (if the seasonal period is 13 or more). If the seasonal period is unknown, there is an option (find.frequency=TRUE) to estimate it first.

 

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Getting started with MongoDB in R

Thu, 2015-05-14 20:00

(This article was first published on OpenCPU, and kindly contributed to R-bloggers)

The first stable version of the new mongolite package has appeared on CRAN. Mongolite builds on jsonlite to provide a simple, high-performance MongoDB client for R, which makes storing and accessing small or large data as easy as converting it to/from JSON. The package vignette has some examples to get you started with inserting, json queries, aggregation and map-reduce. MongoDB itself is open source and installation is easy (e.g. brew install mongodb).

If you use, or (think) you might want to use MongoDB with R, please get in touch. I am interested to hear your about your problems and use cases to make this package fit everyones needs. I will also be presenting this and related work at UseR 2015 and the annual French R Meeting.

To leave a comment for the author, please follow the link and comment on his blog: OpenCPU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs