R bloggers

Syndicate content
R news and tutorials contributed by (563) R bloggers
Updated: 10 hours 13 min ago

Romain Francois at MilanoR meeting

Mon, 2014-12-01 09:25

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Romain Francois is a 32-years old R developer and consultant. He defines himself a "Professional R Enthusiast" and r-enthusiasts.com is his website. He co-authored several R packages, such as dplyr and Rcpp.

Romain Francois writes a world famous blog about R and the R Graph Gallery, that showcases hundreds of examples of data visualization with R.

I am honored to announce that Romain Francois is the special guest of the next MilanoR Meeting.

To leave a comment for the author, please follow the link and comment on his blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Interview with SERATRON – Lego EV3 robot driven by R

Mon, 2014-12-01 00:30

(This article was first published on SmarterPoland » PISA in English, and kindly contributed to R-bloggers)

Next meeting of Warsaw R Enthusiasts (SER = Spotkania Entuzjastów R) will take place on December 8. We are going to start with Roger Bivand talk about spatial statitics (R Foundation / NHH, author of many R packages). The second talk, by Bartosz Meglicki (IPI PAN), will introduce the SERATRON – fusion of R and Lego Mindstorms.

Below we publish an exclusive SER interview with SERATRON (we are sorry but SERATRON is a naughty one):

SER: Hi, could you introduce yourself?
SERATRON: I am a robot coded in pure R with ev3dev.R bindings. The bindings run on ev3dev linux distribution. The operating system runs on Lego EV3 hardware. We are all under heavy development.

SER: Tell us more about the R part.
SERATRON: I am running RServe server. PCs can communicate with me using RSclient. They do all the heavy computations for me, I can’t be bothered (I have no FPU, can you imagine?!?). Those PCs think they can command me…

SER: Why do you have two servo motors?
SERATRON: So that I can move, you silly!

SER: Hmm, why do you have an infrared sensor?
SERATRON: So that I can avoid obstacles and I will not fall from height. Why do you have eyes donkey?

SER: Errrrrr, what are your touch sensors for?
SERATRON: They also prevent me from falling from height. If I don’t feel the surface below me I am not going to move. Would you?

SER: Ok, last question you naughty robot! What is your gyroscope and accelerometer for?
SERATRON: Can’t you count? It’s not one question but two. I use gyro to know my heading direction. No dumbass, not for balancing, I don’t use it in this axis. I don’t use accelerometer now but I could use tilt input as a third level protection from falling. I could also use it to detect collisions with other objects but I have other means.

SER: It was a pleasure to meet you (coughs), thank you.
SERATRON: I hope I will be finished for SER VII – 08.12.2014!

To leave a comment for the author, please follow the link and comment on his blog: SmarterPoland » PISA in English. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Turn R (Shiny) Scripts Into Double-clickable OS X Applications With One Line of Code

Sun, 2014-11-30 23:00

(This article was first published on Data Driven Security, and kindly contributed to R-bloggers)

I was playing with some non-security-oriented R+Shiny code the other day, and thought that Shiny apps would be even more useful if they were double-clickable applications that you could “just run”—provided R was installed on the target system—vs have to cut/paste code into R. Now, I know it’s not hard to enter:

shiny::runGist('95ec24c1b0cb433a76a5', launch.browser=TRUE)

at an R console, but I’ll wager many developers have users that would still appreciate a double-clickable icon. Since I’m running OS X Yosemite on some of my development machines, I thought this would be a good reason to try out Apple’s new Javascript for Applications (JXA) since I am loath to work in regular AppleScript.

If you fire up Apple’s Script Editor, you can choose which language your script is in from the popup on the script window:

With JXA, all you need is one line of code to run a Shiny gist in R:

Application("R").cmd("shiny::runGist('95ec24c1b0cb433a76a5', launch.browser=TRUE)")

If you like/prefer “pure” AppleScript or are on an older version of OS X you still only need four lines of code:

tell application "R" activate cmd "shiny::runGist('95ec24c1b0cb433a76a5', launch.browser=TRUE)" end tell

Save the script as an application and your users will be greeted with your Shiny app in their default browser.

Caveat Scripter

When an application is created this way, it quits immediately after launching R.app and then the R.app window is left open (showing all the R console output from the script). I personally think this is A Very Good Thing, but some folks may not, so you can miniaturize it upon startup via:

R = Application("R") R.windows[0].miniaturized = true R.cmd("shiny::runGist('95ec24c1b0cb433a76a5', launch.browser=TRUE)")

or

tell application "R" activate set miniaturized of window 1 to true cmd "shiny::runGist('95ec24c1b0cb433a76a5', launch.browser=TRUE)" end tell

Users will still need to quit out of R, but you could also add a Shiny actionButton to your app that does a quit(save="no", runLast=FALSE) on submit (example code in this gist) to make it feel like a “real” application.

This all assumes & relies on the fact that the shiny package is already installed on your systems. To ensure any application-dependent packages are installed without forcing the users to manually install them, you can use something like this:

pkg <- c("shiny", "zipcode", "pbapply", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "stringi", "magrittr") new.pkg <- pkg[!(pkg %in% installed.packages())] if (length(new.pkg)) { install.packages(new.pkg) }

at the top of server.R to ensure they are available to your application (note that said hack assumes a CRAN mirror is set).

Finally, for maximum compatibility, you’ll need to use the pure AppleScript version instead of the JXA version unless all your users are on Yosemite or higher.

Example Shiny Snowfall App

If you’re on OS X and have R and the shiny package installed, you can try out the sample “Shiny Snowfall”” app by downloading and unzipping this file (you may need to right/option-click->Save As):

and then running the “Shiny Snowfall” app. (NOTE: You need to have your Security & Privacy settings set to “Allow apps downloaded from ‘Mac App Store and identified developers” to run the application or option/right-click “open” on the app icon”)

The icon used is from Adam Whitcroft’s (@adamwhitcroft) Climacons collection.

To leave a comment for the author, please follow the link and comment on his blog: Data Driven Security. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Book Review: R Object-oriented Programming

Sun, 2014-11-30 20:30

(This article was first published on Data Analysis and Visualization in R, and kindly contributed to R-bloggers)


Recently, Packt Publishing publish the book R Object-oriented Programming. The eleven chapter book covers from basic data types in R to a more advanced method such as simulation and writing functions.  Different data types (i.e. integer, character, factor) are discussed in-depth in the book as well as numeric and string operations. The good thing with it is that it walks the reader from simple to complicated topic making it easier to learn the language. There are two chapters dedicated in discussing about S3 and S4 classes which are not commonly found published books about R. The last two chapters of the book are dedicated case studies wherein the reader can apply all the learnings from previous chapter. A very good book for beginners and basic users who want to advance their skill and explore the full potential of R as a programming language. Good book to have!

To leave a comment for the author, please follow the link and comment on his blog: Data Analysis and Visualization in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Change in temperature in Netherlands over the last century

Sun, 2014-11-30 03:20

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I read a post 'race for the warmest year' at sargasso.nl. They used a plot, originating from Ed Hawkins to see how 2014 progressed to be warmest year. Obviously I wanted to make the same plot using R. In addition, I wondered which parts of the year had most increased in temperature.
DataSimilar to last week, data are acquired from KNMI. They have various sets of data, this page has a selection form which leads to the data used today. The data comes with a header explaining details, unfortunately in Dutch. Of relevance for this post is TG, average and minimum temperature in 0.1 C. Station used is de Bilt, where they got most data. Prior to reading the data into R, the explanatory text header was removed.
The data input is completed by converting YYYYMMDD to date, year, month and dayno variables. Prior to analysis for simplicity a leap day was removed. I chose merge() rather than a dplyr merge function since the latter releveled my moth factor. The data.frame mylegend is used later on to label the first of the month.
library(plyr)
library(dplyr)
library(ggplot2)
library(locfit)
library(plotrix)
r1 <- read.csv('KNMI_20141115.edited.txt')
Sys.setlocale(category = "LC_TIME", locale = "C")
r2 <- mutate(r1,
    date = as.Date(format(YYYYMMDD),'%Y%m%d'),
    month =factor(months(date,abbreviate=TRUE),
        levels=months(as.Date(
                paste('2014',
                    formatC(1:12,digits=2,width=2,flag='0'),
                    '01',sep='-')),
            abbreviate=TRUE)),
    yearf=factor(format(date,'%Y')),
    yearn=as.numeric(substr(YYYYMMDD,1,4)),
    day=format(date,'%e'))

days <- filter(r2,yearn==1901) %>%
    mutate(.,dayno=format(date,'%j') ) %>%
    select(.,month,day,dayno)

r3 <- merge(r2,days,all=TRUE) %>%
    filter(.,!grepl('0229',YYYYMMDD)) %>%
    mutate(.,daynon=as.numeric(dayno))

mylegend <- filter(days,day==' 1') %>%
    mutate(.,daynon=as.numeric(dayno))
PlotsCumulative average by yearEach line is a separate year. For this plot is is needed to have year a factor. Unfortunately, I was not able to get a colourbar legend for colors, that required a continuous year variable. Green is beginning last century, pinkish is recent, the fat black line is 2014.

r4 <- group_by(r3,yearf) %>%
    mutate(.,cmtemp = cummean(TG/10))

g1 <- ggplot(r4,aes(x=daynon,y=cmtemp,
        col=yearf))
g1 + geom_line(alpha=.4,show_guide=FALSE)  +
    scale_x_continuous('Day',
        breaks=mylegend$daynon,
        labels=mylegend$month,
        expand=c(0,0)) +
    scale_y_continuous('Temperature (C)')  +
    geom_line(data=r4[r4$yearf=='2014',],
        aes(x=daynon,y=cmtemp),
        col='black',
        size=2)
2014 with average of 30 years To get a better idea how 2014 compares to previous years, the average of 30 years has been added. We had warm year, except for August, which suggested an early spring. In hindsight, second half of August had colder days than beginning April or end October.


r3$Period <- cut(r3$yearn,c(seq(1900,2013,30),2013,2014),
    labels=c('1901-1930','1931-1960',
        '1961-1990','1991-2013','2014'))
g1 <- ggplot(r3[r3$yearn<2014,],aes(x=daynon,y=TG/10,col=Period))
g1 + geom_smooth(span=.15,method='loess',size=1.5)  +
    scale_x_continuous('Day',
        breaks=mylegend$daynon,
        labels=mylegend$month,
        expand=c(0,0)) +
    geom_line(#aes(x=daynon,y=TG/10),
        data=r3[r3$yearn==2014,]) +
    scale_y_continuous('Temperature (C)')
Change by yearFinally, a plot showing how temperature changed within the years. To obtain this plot, I needed a day corrected base temperature. The baseline temperature is smoothed over days for years 1901 to 1924. The baseline was used to get a corrected baseline, which was subsequently smoothed over years and days.
Smoothers have edge effects, to remove these from the visual part, January and December have been added as extra to the data. Hence within the year there are only minimal edge effects.
The plot shows that middle last century, some parts of the year actually had a drop in temperature. In contrast, November has gradually been getting warmer since middle last century. The new century has seen quite an increase. 
myyears <- r3[r3$yearn<1925,]
m13 <- filter(myyears,daynon<30) %>%
    mutate(.,daynon=daynon+365)
m0 <- filter(myyears,daynon>335) %>%
    mutate(.,daynon=daynon-365)
myyears <- rbind_list(m0,myyears,m13)

nn <- .2
mymod <- locfit(TG ~ lp(daynon,nn=nn),
    data=myyears)
topred <- data.frame(daynon=1:365)
topred$pp <- predict(mymod,topred)
#plot(pp~ daynon,data=topred)

r5 <- merge(r3,topred) %>%
    mutate(.,tdiff=(TG-pp)/10) %>%
    select(.,tdiff,daynon,yearn)
m13 <- filter(r5,daynon<30) %>%
    mutate(.,daynon=daynon+365,
        yearn=yearn-1)
m0 <- filter(r5,daynon>335) %>%
    mutate(.,daynon=daynon-365,
        yearn=yearn+1)
r6 <- rbind_list(m0,r5,m13)

topred <- expand.grid(
    daynon=seq(1:365),
    yearn=1901:2014)
topred$pp2 <- locfit(
        tdiff ~ lp(yearn,daynon,nn=nn),
        data=r6) %>%
    predict(.,topred)
#topred <- arrange(topred,daynon,yearn)

myz <- matrix(topred$pp2,ncol=365)
zmin <- floor(min(topred$pp2)*10)/10
zmax <- ceiling(max(topred$pp2)*10)/10
myseq <- seq(zmin,zmax,.1)
par(mar=c(5,4,4,6))
image(myz,useRaster=TRUE,
    axes=FALSE,frame.plot=TRUE,
    col=colorRampPalette(c('blue','red'))(length(myseq)-1),
    breaks=myseq)
axis((seq(10,114,by=10)-1)/113,labels=seq(1910,2010,by=10),side=1)
axis((mylegend$daynon-1)/365,labels=mylegend$month,side=2)
color.legend(1.1,0,1.2,1,legend=c(zmin,zmax),gradient='y',
    rect.col=colorRampPalette(c('blue','red'))(length(myseq)-1))

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Storing Forecasts in a Database

Sat, 2014-11-29 22:08

(This article was first published on Quintuitive » R, and kindly contributed to R-bloggers)

In my last post I mentioned that I started using RSQLite to store computed results. No rocket science here, but my feeling is that this might be useful to others, hence, this post. This can be done using any database, but I will use (R)SQLite as an illustration.

Let’s assume we are running a long ARMA/GARCH simulation (see here for an example) on daily data on the S&P 500. For each day, we fit numerous models using our package of choice. We need to store all useful information in a database and use it later for various analysis. A reasonable approach is to create two tables. One to store the models, and one to store the forecasts. Here is the relevant RSQLite code:

require(RSQLite) driver = dbDriver("SQLite") connection = dbConnect(driver, dbname=db.path) query = paste( " create table if not exists models ( ", " history integer not null, ", " p integer not null, ", " q integer not null, ", " r integer not null, ", " s integer not null, ", " dist varchar(32) not null)", sep="") dbGetQuery(connection, query) query = paste( " create unique index if not exists models_unique ", " on models(history,p,q,r,s,dist) ", sep="") dbGetQuery(connection, query) query = paste( " create table if not exists forecasts ( ", " model integer not null, ", " date datetime not null, ", " mu real, ", " sigma real, ", " ic real) ", sep="") dbGetQuery(connection, query) query = paste( " create unique index if not exists forecasts_unique ", " on forecasts(model,date) ", sep="") dbGetQuery(connection, query) dbDisconnect(connection)

The models table has one row for each model, which is a unique combination of (history,p,q,r,s,distribution). Each forecast refers to its model (this way we avoid repetitions – a standard database normalization) and specifies its date. The other columns of the forecasts can be pretty much anything that’s needed. In the example, we save the forecasts for the mean and variance and the information criteria (AIC for instance). The update code is a bit more complicated (or at least in my implementation):

# The assumption is that we are in a context (function) containing the # parameters specifying the model, i.e. they are available via the # history,p,q,r,s and dist variables. We also assume that the forecast(s) # are contained in fore.df. driver = DBI::dbDriver("SQLite") connection = DBI::dbConnect(driver, dbname=db.path) # First add the model if it doesn't exist already model.df = data.frame(history, p, q, r, s, dist) colnames(model.df) = c("history", "p", "q", "r", "s", "dist") query = paste( " insert or ignore into models (history, p, q, r, s, dist) ", " values (@history, @p, @q, @r, @s, @dist) ", sep="") RSQLite::dbGetPreparedQuery(connection, query, bind.data=model.df) # Get the model id from the rowid column (added automatically by SQLite) query = paste("select rowid from models where history=? and p=? and q=? and r=? and s=? and dist=?", sep="") rowid = as.numeric(RSQLite::dbGetPreparedQuery(connection, query, bind.data=model.df)) # Set the model id in the forecast data frame fore.df[,1] = rowid colnames(fore.df) = c("model", "date", "mu", "sigma", "ic") DBI::dbBegin(connection) # Insert the forecasts query = paste( " insert or replace into forecasts ( ", " model, date, mu, sigma, ic) ", " values (@model, @date, @mu, @sigma, @ic) ", sep="") RSQLite::dbGetPreparedQuery(connection, query, bind.data=fore.df) DBI::dbCommit(connection) DBI::dbDisconnect(connection)

In summary, insert the model if not already available, get the unique id for the model (using SQLite’s rowid unique column added automatically to each table), add the model id to the forecasts data frame and finally insert the forecasts data frame into the forecasts table.

I have been using this approach for a while and has been quite happy with its usefulness.

Last but not least, if you are performing the simulation in parallel, the update must be wrapped in a critical section (for more details, check my previous post and the flock package).

To leave a comment for the author, please follow the link and comment on his blog: Quintuitive » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

RcppArmadillo 0.4.550.1.0

Sat, 2014-11-29 06:46

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A week ago, Conrad provided another minor release 4.550.0 of Armadillo which has since received one minor correction in 4.550.1.0. As before, I had created a GitHub-only pre-release of his pre-release which was tested against the almost one hundred CRAN dependents of our RcppArmadillo package. This passed fine as usual, and results are as always in the rcpp-logs repository.

Processing and acceptance at the CRAN took a little longer as around the same time a fresh failure in unit tests had become apparent on an as-of-yet unannounced new architecture (!!) also tested at CRAN. The R-devel release has since gotten a new capabilities() test for long double, and we now only run this test (for our rmultinom()) if the test asserts that the given R build has this capability. Phew, so with all that the new version in now on CRAN; Windows binaries have been built and I also uploaded new Debian binaries.

Changes are summarized below; our end also includes added support for conversion of Field types takes to short pull request by Romain.

Changes in RcppArmadillo version 0.4.550.1.0 (2014-11-26)
  • Upgraded to Armadillo release Version 4.550.1 ("Singapore Sling Deluxe")

    • added matrix exponential function: expmat()

    • faster .log_p() and .avg_log_p() functions in the gmm_diag class when compiling with OpenMP enabled

    • faster handling of in-place addition/subtraction of expressions with an outer product

    • applied correction to gmm_diag relative to the 4.550 release

  • The Armadillo Field type is now converted in as<> conversions

Courtesy of CRANberries, there is also a diffstat report for the most recent release. As always, more detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

CRAN Task Views for Finance and HPC now (also) on GitHub

Fri, 2014-11-28 21:14

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The CRAN Task View system is a fine project which Achim Zeileis initiated almost a decade ago. It is described in a short R Journal article in Volume 5, Number 1. I have been editor / maintainer of the Finance task view essentially since the very beginning of these CRAN Task Views, and added the High-Performance Computing one in the fall of 2008. Many, many people have helped by sending suggestions or even patches; email continues to be the main venue for the changes.

The maintainers of the Web Technologies task view were, at least as far as I know, the first to make the jump to maintaining the task view on GitHub. Karthik and I briefly talked about this when he was in town a few weeks ago for our joint Software Carpentry workshop at Northwestern.

So the topic had been on my mind, but it was only today that I realized that the near-limitless amount of awesome that is pandoc can probably help with maintenance. The task view code by Achim neatly converts the very regular, very XML, very boring original format into somewhat-CRAN-website-specific html. Pandoc, being as versatile as it is, can then make (GitHub-flavoured) markdown out of this, and with a minimal amount of sed magic, we get what we need.

And hence we now have these two new repos:

Contributions are now most welcome by pull request. You can run the included converter scripts, it differs between both repos only by one constant for the task view / file name. As an illustration, the one for Finance is below.

#!/usr/bin/r ## if you do not have /usr/bin/r from littler, just use Rscript ctv <- "Finance" ctvfile <- paste0(ctv, ".ctv") htmlfile <- paste0(ctv, ".html") mdfile <- "README.md" ## load packages suppressMessages(library(XML)) # called by ctv suppressMessages(library(ctv)) r <- getOption("repos") # set CRAN mirror r["CRAN"] <- "http://cran.rstudio.com" options(repos=r) check_ctv_packages(ctvfile) # run the check ## create html file from ctv file ctv2html(read.ctv(ctvfile), htmlfile) ### these look atrocious, but are pretty straight forward. read them one by one ### - start from the htmlfile cmd <- paste0("cat ", htmlfile, ### - in lines of the form ^<a href="Word">Word.html</a> ### - capture the 'Word' and insert it into a larger URL containing an absolute reference to task view 'Word' " | sed -e 's|^<a href="\([a-zA-Z]*\)\.html|<a href="http://cran.rstudio.com/web/views/\1.html"|' | ", ### - call pandoc, specifying html as input and github-flavoured markdown as output "pandoc -s -r html -w markdown_github | ", ### - deal with the header by removing extra ||, replacing |** with ** and **| with **: "sed -e's/||//g' -e's/|\*\*/\*\*/g' -e's/\*\*|/\*\* /g' -e's/|$/ /g' ", ### - make the implicit URL to packages explicit "-e's|../packages/|http://cran.rstudio.com/web/packages/|g' ", ### - write out mdfile "> ", mdfile) system(cmd) # run the conversion unlink(htmlfile) # remove temporary html file cat("Done.n")

I am quite pleased with this setup---so a quick thanks towards the maintainers of the Web Technologies task view; of course to Achim for creating CRAN Task Views in the first place, and maintaining them all those years; as always to John MacFarlance for the magic that is pandoc; and last but not least of course to anybody who has contributed to the CRAN Task Views.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Some R Highlights from H20 World

Fri, 2014-11-28 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

H2O.ai held its first H2O World conference over two days at the Computer History Museum in Mountain View, CA. Although the main purpose of the conference was to promote the company's rich set of Java based machne learning algorithms and announce their new products Flow and Play there were quite a few sessions devoted to R and statistics in general.

Before I describe some of these, a few words about the conference itself. H20 World was exceptionally well run, especially for a first try with over 500 people attending (my estimate). The venue is an interesting, accommodating space with plenty of parking, that played well with what, I think, must have been an underlying theme of the conference: acknowledging contributions of past generations of computer scientists and statisticians. There were two stages offering simultaneous talks for at least part of the conference: The Paul Erdős stage and the John Tukey stage. Tukey I got, why put such an eccentric mathematician front and center? I was puzzled until Sri Ambati, H2O.ai's CEO and co-founder remarked that he admired Erdős because of his great generosity with collaboration. To a greater extent than most similar events, H2O World itself felt like a collaboration. There was plenty of opportunity to interact with other attendees, speakers and H20 technical staff (The whole company must have been there). Data scientists, developers and Marketing staff were accessible and gracious with their time. Well done!

R was center stage for a good bit the hands on training that that occupied the first day of the conference. There were several sessions (Exploratory Data Analysis, RegressionDeep LearningClustering and Dimensionality Reduction) on accessing various H2O algorithms through the h2o R package and the H2O API. All of these moved quickly from R to running the custom H2O alogorithms on the JVM. However, the message that came through is that R is the right environment for sophisticated machine learning.

Two great pleasures from the second day of the conference were Trevor Hastie's tutorial on the Gradient Boosting Machine and John Chamber's personal remembrances of John Tukey. It is unusual for a speaker to announce that he has been asked to condense a two hour talk into something just under an hour and then go on to speak slowly with great clarity, each sentence beguiling you into imagining that you are really following the details. (It would be very nice if the video of this talk would be made available.)

Two notable points from Trevor's lecure where understanding gradient boosting as minimizing the exponential loss function and the openness of the gbm algorithm to "tinkering". For the former point see Chapter 10 of the Elements of Statistical Learning or the more extended discussion in Schapire and Freund's Boosting: Foundations and Algorithms.

John Tukey spent 40 years at Bell Labs (1945 - 1985) and John Chamber's tenure there overlapped the last 20 years of Tukey's stay. Chambers who had the opportunity to observe Tukey over this extended period of time painted a moving and lifelike portrait of the man. According to Chambers, Tukey could be patient and gracious with customers and staff, provocative with his statistician colleagues and "intellectually intimidating". John remembered Richard Hamming saying: "John (Tukey) was a genius. I was not." Tukey apparently delighted in making up new terms when talking with fellow statisticians. For example, he called the top and bottom lines that identify the interquartile range on a box plot "hinges" not quartiles. I found it particularly interesting that Tukey would describe a statistic in terms of the process used to compute it, and not in terms of any underlying theory. Very unusual, I would think, for someone who earned a PhD in topology under Solomon Lefschetz. For more memories of John Tukey including more from John Chambers look here.

Other R related highlights were talks by Matt Dowle and Erin Ledell. Matt reprised the update on new features in data.table that he recently gave to the Bay Area useR Group and also presented interesting applications using data.table from UK insurance company Landmark, and KatRisk (Look here for KatRisk part of Matt's presentation).

Erin, author of the h20Ensemble package available on GitHub, delivered an exciting and informative talk on using ensembles of learners (combining gbm models and logistic regression models, for example) to create "superlearners".

Finally, I gave a short talk Revolution Analytics' recent work towards achieving reproducibility in R. The presentation motivates the need for reproducibility by examining the use of R in industry and science and describing how the checkpoint package and Revolution R Open, an open source distribution of R that points to a static repository can be helpful.

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R Object-oriented Programming – Book Review

Fri, 2014-11-28 05:07

(This article was first published on R Video tutorial for Spatial Statistics, and kindly contributed to R-bloggers)



I have been asked to review the book “R Object-oriented Programming” by Kelly Black, edited by Packt publishing (£14.45 for the E-Book, £27.99 for Print+E-Book).
The scope of the book is “to provide a resource for programming using the R language” and therefore it can be seen as a good and practical introduction to all the most commonly used part of R. The first 2 chapters deal with data type and data organization in R. They basically quickly review how to handle each type of data (such as integers, doubles) and how to organize them into R objects. The third chapter deals with reading data from files and save them. This chapter gives a pretty good introduction into reading and writing every sort of data, even binaries, and from a variety of sources, including from the web. Chapter 4 provides an introduction to R commands to generate random numbers, in particular it gives a thorough overview of the sample command. Chapters 5 and 6 give a good background into the use of R to manipulate string and time variables. Of particular interest throughout the book is the handling of data gathered from public source on the web. For these particular data skills in string manipulation become crucial both for handling web addresses and also to extract the actual data from the information returned by the server. For this reason I think this book does a good job in introducing these important aspect of the R language. Chapter 7 introduces some basic programming concepts, such as if statements and loops. Chapters 8 and 9 provide a complete overview of the S3 and S4 classes and finally chapters 10 and 11 are two hands on examples on how to put together all the concepts learned in the book to solve very practical problems. In these example the reader will be guided towards the creation of powerful R programs to grade student and perform a Monte Carlo simulation.The book is written in a very practical form, meaning that not much time is wasted explaining each function in details, readers can browse the help pages of each function for more details. This means that probably this book is not for newbies to programming languages. Most of the learning is done by exploring the lines of code provided and for this reason I think the best readers would be people familiar with a programming language, even though I do not think that readers necessarily needs some familiarity with R. However as stated on the website, the target for this book are beginners who wants to become more “fluent” with the language.Overall, I think this book does a good of providing the reader with a strong and neat introduction to all the bits of coding required to become more comfortable writing advance scripts. For example, at the end of chapter 2 the author discuss the use of the applyset of commands. These are crucial milestone to be learned for every individual who wants to switch from a mundane use of R to a more advanced and rigorous use of the language. In my personal experience when I began using R I would often create very long script using lots of loops and if statements, which tends to greatly decrease the execution speed. As soon as I learned to master the apply set of commands I was able to reduce my code and crucially I was also able to substantially increase its executing speed. Personally I would have loved to have access to such a book back then! The use of web sources for data manipulation is also a very nice addition that as far as I know is not common in other introductory texts. Nowadays gathering data from the web has become the norm and therefore I think it is important to provide beginners with tools to handle these type of data.The strength of this book however is in chapters 8 and 9, which provide an extensive introduction to the use of the classes S3 and S4. I think these two chapters alone would justify the price for buying it. As far as I know these concepts are generally not treated with the right attention in books for beginners. They may explain you that when you load a package then the functions you normally use, such as plot, may change their function and options. However, I never found an introductory book that provides such as exhaustive explanation of how to fully control these classes to create advance programs. Of particular interest are also the two examples provided in Chapters 10 and 11. These are practical exercises that put together all the concepts learned in the previous chapters with the purpose of creating R programs that can be easily implemented and share. Chapter 10 for example describe a neat and powerful way to create a new R program to grade students. In this chapter the reader will use all the basic programming concept learned during the course of the book and he/she will put them together for creating an R program to import grades from csv files, manipulate them and create summary statistics and plot. In conclusion, I see a variety of uses for this book. Clearly it is targeted to post beginners who need a short way to unlock the full power of R for their daily statistical routines. However, this book does not loose its purpose after we learned to properly use the language. It is written in such a way that even for experienced R users it is a useful way to quickly look-up functions and methods that maybe they do not use very often. I sometimes forget how to use certain functions and having such a book on my office bookshelf will certainly help me in these frustrating situations. So I think it will become part of the set of references that future R user will use on a regular basis.

To leave a comment for the author, please follow the link and comment on his blog: R Video tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R Package or Library

Fri, 2014-11-28 03:29

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

In these days, I was talking about an R package I developed with a colleague. He used several times the word library to refer to the R package. So, I realized that many R users do not know that package and library are not synonymous when referring to R.

The "Writing R Extensions" manual is clear: "A package is not a library", although the same manual admits "this is a persistent mis-usage".

What is a package

An R package is a directory of files which extend R. Some authors say that R packages are a god way to distribute R code as well as papers are a good way to disseminate scientific researches. Rossi provides some good reasons to write an R package.

  • We refer to the directory containing files as a source package, the master files of a package. These directory can be compressed in a tarball containing the files of a source package, the .tar.gz version of the source package.
  • An installed package is the result of running R CMD INSTALL or install.packages() at the R console on a source package.
  • On some platforms (notably OS X and Windows) there are also binary packages, a zip file or tarball containing the files of an installed package which can be unpacked rather than installing from sources.

Summarizing: we can refer to the source package as the human readable version of the package and to the installed package or to the binary package as the computer readable version.

What is a library

In R, a library can refer to:

  • A directory into which packages are installed, e.g. /usr/lib/R/library.
  • A shared, dynamic or static library or (especially on Windows) a DLL, where the second L stands for ‘library’. Installed packages may contain compiled code in what is known on Unix-alikes as a shared object and on Windows as a DLL.
Origin of the mis-used

"Writing R extension" manual suggest that the mis-use seems to stem from S, whose analogues of R’s packages were officially known as library sections and later as chapters, but almost always referred to as libraries.

I add to this that the R function to load packages, i.e. library(), doesn't help to understand. By the way, before loading a package we have to install it with install.packages(). Than we will load the package from the directory into which package is installed that is a library.

Summary

At the end of this post, what should a new user remember of all these?

Ramarro, a web book about advanced R programming written by Andrea Spanò, contains a useful summary.

Terms about R packages are often confused. This may help to clarify:

  • Package: a collection of R functions, data, and compiled code in a well-defined format.
  • Library: the directory where packages are installed.
  • Repository: A website providing packages for installation.
  • Source: The original version of a package with human-readable text and code.
  • Binary: A compiled version of a package with computer-readable text and code, may work only on a specific platform.

To leave a comment for the author, please follow the link and comment on his blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R, an Integrated Statistical Programming Environment and GIS

Thu, 2014-11-27 19:00

(This article was first published on Robin Lovelace - R, and kindly contributed to R-bloggers)

This article was originally published in Geoinformatics magazine.

R is well known as a powerful, extensible and relatively fast statistical programming language and open software project with a command line interface (CLI). What is less well known is that R also has cutting edge spatial packages that allow it to behave as a fully featured Geographical Information System in the full sense of the word. In fact, some of cutting edge algorithms for image processing and spatial statistics are implemented in R before any other widely available software product (Bivand et al. 2013). Sophisticated techniques such as geographically weighted regression and spatial interaction models can be custom built around your spatial data in R. But R also works as a general purpose GIS, with mature functions for performing all established techniques of spatial analysis such as spatial selections, buffers and clipping. What is unique about R is that all these capabilities are found in a single programme: R provides a truly integrated modelling environment.

The advantages and drawbacks of R as a GIS

Despite being able to perform the same operations as dedicated GIS software such as ArcGIS and QGIS, R is fundamentally different in the way that the user interacts with it. Not only are most operations complete by typing (e.g. you type “plot(map1)” to plot the data contained in the map1 object), the visualisation stage is different. There is no dynamic canvas which can be used to pan and zoom: instead R only produces visual or other types of output when commanded to do so, using functions such as plot.

Table 1: A summary of the relative merits of R compared with more traditional GIS software.

Attribute Advantages of R Drawbacks of R User interface Command line interface allows rapid description of workflow and reproducibility Steep learning curve (eased by RStudio) Visualising data Sophisticated and customisable graphics No dynamic zoomable canvas Selecting data Concise and consistent method using square brackets (e.g. “map1[x > y,]”) Difficult to dynamically select objects from map Manipulating data Very wide range of functions through additional packages Only single core processing Analysing/modelling data Integrated processing, analysis, and modelling framework Sometimes more than one solution available How to get started with spatial data in R

Here is not the place to go into the details of spatial data analysis in R. Instead, we provide a number of resources that will enable rapidly getting up to speed with both R and its spatial packages. Note that when we say “packages” for R, we are referring to specific add-ons, analogous to extentions in ArcGIS and QGIS. The range of add ons is truly vast, as can be seen on the R website: http://cran.r-project.org/web/views/Spatial.html. This diversity can be daunting and it is certainly frustrating when you know that a problem can be solved in several different ways: R teaches you to think about your data and analysis carefully. The most commonly used packages for spatial analysis are probably sp (the basis of spatial functionality in R), rgdal (for loading spatial file formats such as shapefiles) and rgeos (for spatial analysis). Each is installed and loaded in the same way: rgeos, for example, is installed and loaded by typing:

install.packages(“rgeos”) library(rgeos)

An introductory tutorial on R as a GIS is the working paper “Introduction to visualising spatial data in R” (Lovelace and Cheshire, 2014). This document, along with sample code and data, is available free online. It contains links to many other spatial R resources, so is a good starting point for further explorations of R as a GIS.

R in action as a GIS

To demonstrate where R’s flexibility comes into its own, imagine you have a large number of points that you would like to analyse. These are heavily clustered in a few area, and you would like to a) know where these clusters are; b) remove the points in these clusters and create single points in their place, which summarise the information contained in the clustered points; c) visualise the output.

We will not actually run through all the steps needed to do this in R. Suffice to know that it is possible in R and very difficult to do in other GIS packages such as QGIS, ArcGIS or even PostGIS (another command line GIS that is based on the database language Postgres, a type of SQL): I was asked to tackle this problem by the Spanish GIS service SIGTE after other solutions had been tried.

All of the steps needed to solve the problem, including provision of example data, are provided online. Here I provide an overview of the processes involved and some of the key functions to provide insight into the R way of thinking about spatial data.

First the data must be loaded and converted into yet another spatial data class. This is done using the readOGR function from the rgdal package mentioned above and then using the command as(SpatialPoints(stations), "ppp") to convert the spatial object stations into the ppp class from the spatstat package.

Next the data is converted into a density raster. The value of each pixel corresponds to the interpolated density of points in that area. This is visualised using the plot and contour functions to ensure that the conversion has worked properly.

The raster image is converted into smooth lines using the contourLines. The lines, one for each cluster zone, must then be converted into polygons using the command gPolygonize(SLDF[5, ]). gPolygonize is a very useful function from the rgeos package which automates the conversion of lines into polygons.

The results are plotted using the rather lengthy set of commands shown below. This results in the figure displayed above (notice the red argument creates the red fill of the zones):

plot(Dens, main = "") plot(lnd, border = "grey", lwd = 2, add = T) plot(SLDF, col = terrain.colors(8), add = T) plot(cAg, col = "red", border = "white", add = T) graphics::text(coordinates(cAg) + 1000, labels = cAg$CODE)

Finally the points inside the cluster polygons are extracted using R’s very concise spatial subsetting syntax:

sIn <- stations[cAg, ] # select the stations inside the clusters

The code above, translated into English, means “take all the station points within the cluster polygon called cAg and save them as a new object call sIn”.

Conclusion

The purpose of the article has been to introduce the idea that GIS software can take many forms, including the rather unusual command line interface of the statistical programming language R. Hopefully any preconceptions about poor performance have been dispelled by the examples of converting text into spatial objects and of clustering points. Both examples would be difficult to undertake in more traditional GIS software packages. R has a steep learning curve and should probably be seen more as a powerful tool to use in harmony with other GIS packages for dealing with particularly tricky/unconventional tasks rather than a standalone GIS package in its own right form most users. R interfaces to QGIS and ArcGIS should make this easier, although this solution is not yet mature. In addition to new functionalities, R should also provide a new way of thinking about spatial data for many GIS users (Lovelace and Cheshire, 2014). Yes it has a steep learning curve, but it’s a fun curve to be on whether you are on the beginning of the ascent or in the near vertical phase of the exponential function!

References

Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2013). Applied spatial data analysis with R (Vol. 747248717). Springer.

Lovelace, R., & Cheshire, J. (2014). Introduction to visualising spatial data in R. National Centre for Research Methods, 1403. Retrieved from http://eprints.ncrm.ac.uk/3295/

Wickham, H. (2014). Tidy data. The Journal of Statistical Software, 14(5). Retrieved from http://vita.had.co.nz/papers/tidy-data.html

To leave a comment for the author, please follow the link and comment on his blog: Robin Lovelace - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Le Monde puzzle [#887quater]

Thu, 2014-11-27 18:14

(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)

And yet another resolution of this combinatorics Le Monde mathematical puzzle: that puzzle puzzled many more people than usual! This solution is by Marco F, using a travelling salesman representation and existing TSP software.

N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?

For instance, take n=199, you should first calculate the “friends”. Save them on a symmetric square matrix:

m1 <- matrix(Inf, nrow=199, ncol=199) diag(m1) <- 0 for (i in 1:199) m1[i,friends[i]] <- 1

Export the distance matrix to a file (in TSPlib format):

library(TSP) tsp <- TSP(m1) tsp image(tsp) write_TSPLIB(tsp, "f199.TSPLIB")

And use a solver to obtain the results. The best solver for TSP is Concorde. There are online versions where you can submit jobs:

0 2 1000000 2 96 1000000 96 191 1000000 191 168 1000000 ...

The numbers of the solution are in the second column (2, 96, 191, 168…). And they are 0-indexed, so you have to add 1 to them:

3 97 192 169 155 101 188 136 120 49 176 148 108 181 143 113 112 84 37 63 18 31 33 88168 193 96 160 129 127 162 199 90 79 177 147 78 22 122 167 194 130 39 157 99 190 13491 198 58 23 41 128 196 60 21 100 189 172 152 73 183 106 38 131 125 164 197 59 110 146178 111 145 80 20 61 135 121 75 6 94 195166 123 133 156 69 52 144 81 40 9 72 184 12 24 57 87 82 62 19 45 76 180 109 116 173 151 74 26 95 161 163 126 43 153 17154 27 117 139 30 70 11 89 107 118 138 186103 66 159 165 124 132 93 28 8 17 32 45 44 77 179 182 142 83 86 14 50 175 114 55 141 115 29 92 104 185 71 10 15 34 27 42 154 170 191 98 158 67 102 187 137 119 25 56 65 35 46 150 174 51 13 68 53 47 149 140 85 36 64 105 16 48
Filed under: Books, Kids, R, Statistics, University life Tagged: Le Monde, mathematical puzzle, travelling salesman Concorde

To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Eggnog for Thanksgiving

Thu, 2014-11-27 09:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It's Thanksgiving Day here in the US, so we're taking the day off to spend some time with our families and to eat far too much food. If you're in the US or celebrating Thanksgiving elsewhere, enjoy the day! And for everyone in this season of joy, here's a handy app to scale your eggnog recipe, however many people are around your table. 

The R code and the Shiny app is from Hadley Wickham. Enjoy!

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New Australian data on the HMD

Wed, 2014-11-26 21:33

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

The Human Mortality Database is a wonderful resource for anyone interested in demographic data. It is a carefully curated collection of high quality deaths and population data from 37 countries, all in a consistent format with consistent definitions. I have used it many times and never cease to be amazed at the care taken to maintain such a great resource.

The data are continually being revised and updated. Today the Australian data has been updated to 2011. There is a time lag because of lagged death registrations which results in undercounts; so only data that are likely to be complete are included.

Tim Riffe from the HMD has provided the following information about the update:

  1. All death counts since 1964 are now included by year of occurrence, up to 2011. We have 2012 data but do not publish them because they are likely a 5% undercount due to lagged registration.
  2. Death count inputs for 1921 to 1963 are now in single ages. Previously they were in 5-year age groups. Rather than having an open age group of 85+ in this period counts usually go up to the maximum observed (stated) age. This change (i) introduces minor heaping in early years and (ii) implies different apparent old-age mortality than before, since previously anything above 85 was modeled according to the Methods Protocol.
  3. Population denominators have been swapped out for years 1992 to the present, owing to new ABS methodology and intercensal estimates for the recent period.

Some of the data can be read into R using the hmd.mx and hmd.e0 functions from the demography package. Tim has his own package on github that provides a more extensive interface.

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

How About a “Snowdoop” Package?

Wed, 2014-11-26 16:40

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

Along with all the hoopla on Big Data in recent years came a lot of hype on Hadoop.  This eventually spread to the R world, with sophisticated packages being developed such as rmr to run on top of Hadoop.

Hadoop made it convenient to process data in very large distributed databases, and also convenient to create them, using the Hadoop Distributed File System.  But eventually word got out that Hadoop is slow, and very limited in available data operations.

Both of those shortcomings are addressed to a large extent by the new kid on the block, Spark, which has an R interface package,sparkr.  Spark is much faster than Hadoop, sometimes dramatically so, due to strong caching ability and a wider variety of available operations.  Recently distributedR has also been released, again with the goal of using R on voluminous data sets, and there is also the more established pbdR.

However, I’d like to raise a question here:  Do we really need all that complicated machinery?  I’ll propose a much simpler alternative here, and am curious to see what people think.  (Disclaimer:  I have only limited experience with Hadoop, and only a bit with SparkR.   I’ll present a proposal below, and very much want to see what others think.)

These packages ARE complicated.  There is a considerable amount of configuration to do, worsened by dependence on infrastructure software such as Java or MPI, and in some cases by interface software such as rJava.  Some of this requires systems knowledge that many R users may lack.  And once they do get these systems set up, they may be required to design algorithms with world views quite different from R, even though they are coding in R.

Here is a possible alternative:  Simply use the familiar cluster-oriented portion of R’s parallel package, an adaptation of snow; I’ll refer to that portion of parallel as Snow, and just for fun, call the proposed package Snowdoop.  I’ll illustrate it with the “Hello world” of Hadoop, word count in a text file (slightly different from the usual example, as I’m just counting total words here, rather than the number of times each distinct word appears.)

(It’s assumed here that the reader is familiar with the basics of Snow.  If not, see the first chapter of the partial rough draft of my forthcoming book.)

Say we have a data set that we have partitioned into two files,words.1 and words.2.  In my example here, they will contain the R sign-on message, with words.1 consisting of

R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale

and words.2 containing.

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

Here is our code:

# give each node in the cluster cls an ID number assignids <- function(cls) { clusterApply(cls,1:length(cls), function(i) myid <<- i) } # each node executes this function getwords <- function(basename) { fname <- paste(basename,".",myid,sep="") words <- scan(fname,what="") length(words) } # manager wordcount <- function(cls,basename) { assignids(cls) clusterExport(cls,"getwords") counts <- clusterCall(cls,getwords,basename) sum(unlist(counts)) } # call example: > library(parallel) > c2 <- makeCluster(2) > wordcount(c2,"words") [1] 83

This couldn’t be simpler.  Yet it does what we want:

  • parallel computation on chunks of a distributed file, on independently-running nodes
  • automated “caching” (use the R <<- operator with the output ofscan() above)
  • no configuration or platform worries
  • ordinary R programming, no “foreign” concepts

Indeed, it’s so simple that Snowdoop would hardly be worthy of being called a package.  It could include some routines for creating a chunked file, general file read/write routines, parallel load/save and so on, but it would still be a very small package in the end.

Granted, there is no data redundancy built in here, and we possibly lose pipelining effects, but otherwise, it seems fine.  What do you think?

To leave a comment for the author, please follow the link and comment on his blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Confidence vs. Credibility Intervals

Wed, 2014-11-26 15:38

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Tomorrow, for the final lecture of the Mathematical Statistics course, I will try to illustrate - using Monte Carlo simulations - the difference between classical statistics, and the Bayesien approach.

The (simple) way I see it is the following,

  • for frequentists, a probability is a measure of the the frequency of repeated events, so the interpretation is that parameters are fixed (but unknown), and data are random
  • for Bayesians, a probability is a measure of the degree of certainty about values, so the interpretation is that parameters are random and data are fixed

Or to quote Frequentism and Bayesianism: A Python-driven Primer,  a Bayesian statistician would say "given our observed data, there is a 95% probability that the true value of  falls within the credible region" while a Frequentist statistician would say "there is a 95% probability that when I compute a confidence interval from data of this sort, the true value of  will fall within it".

To get more intuition about those quotes, consider a simple problem, with Bernoulli trials, with insurance claims. We want to derive some confidence interval for the probability to claim a loss. There were = 1047 policies. And 159 claims.

Consider the standard (frequentist) confidence interval. What does that mean that

is the (asymptotic) 95% confidence interval? The way I see it is very simple. Let us generate some samples, of size , with the same probability as the empirical one, i.e. (which is the meaning of "from data of this sort"). For each sample, compute the confidence interval with the relationship above. It is a 95% confidence interval because in 95% of the scenarios, the empirical value lies in the confidence interval. From a computation point of view, it is the following idea,

> xbar <- 159 > n <- 1047 > ns <- 100 > M=matrix(rbinom(n*ns,size=1,prob=xbar/n),nrow=n)

I generate 100 samples of size . For each sample, I compute the mean, and the confidence interval, from the previous relationship

> fIC=function(x) mean(x)+c(-1,1)*1.96*sqrt(mean(x)*(1-mean(x)))/sqrt(n) > IC=t(apply(M,2,fIC)) > MN=apply(M,2,mean)

Then we plot all those confidence intervals. In red when they do not contain the empirical mean

> k=(xbar/n<IC[,1])|(xbar/n>IC[,2]) > plot(MN,1:ns,xlim=range(IC),axes=FALSE, + xlab="",ylab="",pch=19,cex=.7, + col=c("blue","red")[1+k]) > axis(1) > segments(IC[,1],1:ns,IC[,2],1: + ns,col=c("blue","red")[1+k]) > abline(v=xbar/n)

Now, what about the Bayesian credible interval ? Assume that the prior distribution for the probability to claim a loss has a distribution. We've seen in the course that, since the Beta distribution is the conjugate of the Bernoulli one, the posterior distribution will also be Beta. More precisely

Based on that property, the confidence interval is based on quantiles of that (posterior) distribution

> u=seq(.1,.2,length=501) > v=dbeta(u,1+xbar,1+n-xbar) > plot(u,v,axes=FALSE,type="l") > I=u<qbeta(.025,1+xbar,1+n-xbar) > polygon(c(u[I],rev(u[I])),c(v[I], + rep(0,sum(I))),col="red",density=30,border=NA) > I=u>qbeta(.975,1+xbar,1+n-xbar) > polygon(c(u[I],rev(u[I])),c(v[I], + rep(0,sum(I))),col="red",density=30,border=NA) > axis(1)

What does that mean, here, that we have a 95% credible interval. Well, this time, we do not draw using the empirical mean, but some possible probability, based on that posterior distribution (given the observations)

> pk <- rbeta(ns,1+xbar,1+n-xbar)

In green, below, we can visualize the histogram of those values

> hist(pk,prob=TRUE,col="light green", + border="white",axes=FALSE, + main="",xlab="",ylab="",lwd=3,xlim=c(.12,.18))

And here again, let us generate samples, and compute the empirical probabilities,

> M=matrix(rbinom(n*ns,size=1,prob=rep(pk, + each=n)),nrow=n) > MN=apply(M,2,mean)

Here, there is 95% chance that those empirical means lie in the credible interval, defined using quantiles of the posterior distribution. We can actually visualize all those means : in black the mean used to generate the sample, and then, in blue or red, the averages obtained on those simulated samples,

> abline(v=qbeta(c(.025,.975),1+xbar,1+ + n-xbar),col="red",lty=2) > points(pk,seq(1,40,length=ns),pch=19,cex=.7) > k=(MN<qbeta(.025,1+xbar,1+n-xbar))| + (MN>qbeta(.975,1+xbar,1+n-xbar)) > points(MN,seq(1,40,length=ns), + pch=19,cex=.7,col=c("blue","red")[1+k]) > segments(MN,seq(1,40,length=ns), + pk,seq(1,40,length=ns),col="grey")

More details and exemple on Bayesian statistics, seen with the eyes of a (probably) not Bayesian statistician in my slides, from my talk in London, last Summer,

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Happy Thanksgiving | More Examples of XML + rvest with SVG

Wed, 2014-11-26 15:34

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

I did not intend for this little experiment to become a post, but I think the code builds nicely on the XML + rvest combination (also see yesterday’s post) for working with XML/HTML/SVG documents in R. It all started when I was playing on my iPhone in the Sketchbook app and drew a really bad turkey.  Even though, the turkey was bad, I thought it would be fun to combine with vivus.js.  However,

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Slightly Advanced rvest with Help from htmltools + XML + pipeR

Wed, 2014-11-26 13:16

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

Hadley Wickham’s post “rvest: easy web scraping with R” introduces the fine new package rvest very well.  For those now yearning a slightly more advanced example with a little help from pipeR + htmltools + XML, I thought this might fill your yearn.  The code grabs css information running the fancy new site cssstats.com on my blog site.  With the background colors, it makes and labels some

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

MilanoR meeting: 18th December

Wed, 2014-11-26 12:51

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

MilanoR staff is happy to announce the next MilanoR meeting.

When

Thursday, December 18, 2014

from 6 to 8 pm

Agenda

Welcome Presentation
by Nicola Sturaro
Consultant at Quantide

Shine your Rdata: multi-source approach in media analysis for telco industry
by Giorgio Suighi (Head Of Analytics), Carlo Bonini (Data Scientist) and Paolo Della Torre (ROI Manager), MEC

The second speaker will be announced soon. If you follow R blogs or tweets may be you already know his/her name. Otherwise, you should wait until Monday. Stay connected!

Where

Fiori Oscuri Bistrot & Bar
Via Fiori Oscuri, 3 - Milano (Zona Brera)

Buffet

Our sponsors will provide the buffet after the meeting.
MilanoR meeting is sponsored by
    

MilanoR is a free event, open to all R users and enthusiasts or those who wish to learn more about R. Places are limited so, if you would like to attend to the MilanoR meeting, please register below. (If you're reading this post from a news feed, e.g. from R-bloggers, please visit the original post in the MilanoR website to see the form and subscribe the event)

[contact-form-7]

To leave a comment for the author, please follow the link and comment on his blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs