R bloggers

Syndicate content
R news and tutorials contributed by (552) R bloggers
Updated: 3 hours 1 min ago

Accessing iNaturalist data

Wed, 2014-03-26 02:00

(This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers)

The iNaturalist project is a really cool way to both engage people in citizen science and collect species occurrence data. The premise is pretty simple, users download an app for their smartphone, and then can easily geo reference any specimen they see, uploading it to the iNaturalist website. It let's users turn casual observations into meaningful crowdsourced species occurrence data. They also provide a nice robust API to access almost all of their data. We've developed a package rinat that can easily access all of that data in R. Our package spocc uses iNaturalist data as one of it's sources, rinat provides an interface for all the features available in the API.

Searching Currently you can get access to iNaturalist occurrence records from our package spocc, which works great for scenarios where you want lot's of data from many sources, but rinat will get you full details on every record and offers other searching on terms other than species names. First let's see how this matches with what you can get with spocc.

options(stringsAsFactors = F) library(spocc) ## Loading required package: ggplot2 library(rinat) out <- occ(query = "Accipiter striatus", from = "inat") inat_out <- get_inat_obs(taxon = "Accipiter striatus", maxresults = 25) ### Compare Id's and see that results are the same without viewing full tables cbind(out$inat$data$Accipiter_striatus$Id[1:5], inat_out$Id[1:5]) ## [,1] [,2] ## [1,] 581369 581369 ## [2,] 574433 574433 ## [3,] 570635 570635 ## [4,] 555214 555214 ## [5,] 551405 551405

The results are the same, the rinat package will offer a bit more flexiblity in searching. You can search for records by a fuzzy search query, a taxon (used above in spocc), a location in a bounding box, or by date. Let's say you just want to search by for records of Mayflies, you can use the taxon parameter to search for all lower level taxonomic matches below order.

may_flies <- get_inat_obs(taxon = "Ephemeroptera") ## See what species names come back. may_flies$Species.guess[1:10] ## [1] "Mayfly" "Heptageniidae" "Ephemerella subvaria" ## [4] "Ephemerella subvaria" "Mayflies" "Stream Mayflies" ## [7] "Mayflies" "Mayflies" "Mayflies" ## [10] "Hexagenia"

You could also search using the fuzzy query parameter, looking for mentions of a specific habitat or a common name. Below I'll search for one of my favorite habitats, vernal ponds and see what species come back. Also we can search for common names and see the scientific names (which should be all the same).

vp_obs <- get_inat_obs(query = "vernal pool") vp_obs$Species.guess[1:10] ## [1] "Docks (Genus Rumex)" ## [2] "Blennosperma bakeri" ## [3] "Rails, Gallinules, and Coots" ## [4] "Western Spadefoot" ## [5] "Western Spadefoot" ## [6] "Eupsilia" ## [7] "upland chorus frog" ## [8] "Wood Frog" ## [9] "Striped Meadowhawk (Sympetrum pallipes)" ## [10] "Ambystoma maculatum" # Now le'ts look up by common name: deer <- get_inat_obs(query = "Mule Deer") deer$Scientific.name[1:10] ## [1] "Odocoileus hemionus" "Odocoileus hemionus" "Odocoileus hemionus" ## [4] "Odocoileus hemionus" "Odocoileus hemionus" "Odocoileus hemionus" ## [7] "Odocoileus hemionus" "Odocoileus hemionus" "Odocoileus" ## [10] "Odocoileus hemionus"

All of these general searching functions return a dataframe that is m x 32 (where m is the requested number of results). The column names are mostly self-explanatory, including, common names, species names, observer id's, observer names, data quality, licenses and url's for images so you can go look at the photo a user took.

Filtering

All searches can also be filtered by space and time. You can search for records within a specific bounding box, or on a specific date (but not a range). We can redo our deer search using a bounding box for the western United States.

bounds <- c(38.44047, -125, 40.86652, -121.837) deer <- get_inat_obs(query = "Mule Deer", bounds = bounds) cat(paste("The number of records found in your bunding box:", dim(deer)[1], sep = " ")) ## The number of records found in your bunding box: 47

By checking the dimensions, we can see only 47 records were found. We could try the samething for a given day, month or year. Let's try searhing for cumulative totals of observations of Ephemeroptera and see if we can detect seasonality.

library(ggplot2) out <- rep(NA, 12) for (i in 1:12) { out[i] <- dim(get_inat_obs(taxon = "Ephemeroptera", month = i, maxresults = 200))[1] } out <- data.frame(out) out$month <- factor(month.name, levels = month.name) ggplot(out, aes(x = month, y = out, group = 1)) + geom_point() + stat_smooth(se = FALSE) + xlab("Month") + ylab("Cumulative of Mayfly observations") + theme_bw(16)

Exactly as you'd expect observations of this season insect tend to peak in the summer and then slowly decline. Except for September peak, it follows the expected trend.

User and project data

There are several other functions from the API that allow you to access data about projects and users. You can grab detailed data about projects, users and observations. Let's look at the EOL state flowers project. First we can grab some basic info on the project by searching for it based on it's "slug". You can find this in the URL of the project: "http://www.inaturalist.org/projects/state-flowers-of-the-united-states-eol-collection", which is the section of text after "projects/", so in this case it would be "state-flowers-of-the-united-states-eol-collection"

Let's grab some info on the project by getting observations but set the type as "info"

eol_flow <- get_inat_obs_project("state-flowers-of-the-united-states-eol-collection", type = "info", raw = FALSE) ## 204 Records ## 0 ### See how many taxa there are, and how many counts there have been cat(paste("The project has observed this many species:", eol_flow$taxa_number, sep = " ")) ## The project has observed this many species: 20 cat(paste("The project has observed this many occurrences:", eol_flow$taxa_count, sep = " ")) ## The project has observed this many occurrences: 204

We can grab all the observations from the project as well just by setting the type as "observations". Then it's easy to to get details about specific observations or users.

eol_obs <- get_inat_obs_project("state-flowers-of-the-united-states-eol-collection", type = "observations", raw = FALSE) ## 204 Records ## 0-100-200-300 ## See just the first few details of an observation. head(get_inat_obs_id(eol_obs$Id[1])) ## $captive ## NULL ## ## $comments_count ## [1] 0 ## ## $community_taxon_id ## [1] 48225 ## ## $created_at ## [1] "2013-04-08T15:49:15-07:00" ## ## $delta ## [1] FALSE ## ## $description ## [1] "" ## See the first five species this user has recorded head(get_inat_obs_user(as.character(eol_obs$User.login[1]), maxresults = 20))[, 1] ## [1] "Lynx rufus" "Melanerpes formicivorus" ## [3] "Lontra canadensis" "Buteo lineatus" ## [5] "Icteridae" "Pelecanus occidentalis"

There are many more details that you can get, like counts of observations by place ID (extracted from the project or observation, but not well exposed to users), the most common species by date, or by user. There is almost no end to the details you can extract. If you ever wanted to do a case study of a citizen science project, you could get data to answer almost any question you had about the iNaturalist project with rinat.

Finally, what species occurrence package wouldn't be complete without some basic mapping. This function will generate a quick map for you based on a data frame of observations from rinat. These can be from functions such as get_inat_obs, or get_inat_obs_project. Let's end by plotting all the observations from the EOL state flowers project.

### Set plot to false so it returns a ggplot2 object, and that let's us modify ### it. eol_map <- inat_map(eol_obs, plot = FALSE) ### Now we can modify the returned map eol_map + borders("state") + theme_bw() + xlim(-125, -65) + ylim(25, 50)

To leave a comment for the author, please follow the link and comment on his blog: rOpenSci Blog - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

RProtoBuf 0.4.1

Tue, 2014-03-25 22:37

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new bug-fix release release 0.4.1 of RProtoBuf, is now on CRAN. RProtoBuf provides GNU R bindings for the Google Protocol Buffers ("Protobuf") data encoding library used and released by Google, and deployed as a language and operating-system agnostic protocol by numerous projects.

Murray once more shouldered most of the actual burden and fixed a number of issues detailed below.

Changes in RProtoBuf version 0.4.1 (2014-03-25)
  • Document and add a test for the deprecated group functionality.

  • Add a CITATION file pointing to our arXiv.org preprint.

  • Fix a bug in the show method for EnumDescriptor types.

  • Import all top-level enums from imported .proto files.

  • Removed duplicate enum value type from the unit tests that caused problems with the most recent libprotobuf-2.5. (without option allow_alias).

CRANberries also provides a diff to the previous release. More information is at the RProtoBuf page which has a draft package vignette, a 'quick' overview vignette and a unit test summary vignette. Questions, comments etc should go to the rprotobuf mailing list off the RProtoBuf page at R-Forge.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R 101: Summarizing Data

Tue, 2014-03-25 22:25

(This article was first published on Mathew Analytics » R, and kindly contributed to R-bloggers)

When working with large amounts of data that is structured in a tabular format, a common operation is to summarize that data in different ways using specific variables. In Microsoft Excel, pivot tables are a nice feature that is used for this purpose. While not as “efficient” in relation to Excel pivot tables, R also has similar calculations that can be used to summarize large amount of data. In the following R code, I utilize R to summarize a data frame by specific variables.

## CREATE DATA   dat = data.frame( name=c("Tony","James","Sara","Alice","David","Angie","Don","Faith","Becky","Jenny", "Kristi","Neil","Brandon","Kara","Kendra","Liz","Gina","Amber","Alice","George"), state=c("KS","IA","CA","FL","MI","CO","KA","CO","KS","CA","MN","FL","NM","MS","GA", "IA","IL","ID","NY","NJ"), gender=c("M","M","F","F","F","M","F","M","F","F","F","M","M","F","F","F","F","F","F","M"), marital_status=c("M","S","S","S","M","M","S","M","S","M","M","S","S","S","M","M","S","M","S","M"), credit=c("good","good","poor","fair","poor","fair","fair","fair","good","fair", "good","good","poor","fair","poor","fair","fair","fair","good","fair"), owns_home=c(0,1,0,0,1,0,1,1,1,1,0,1,0,0,1,0,1,1,1,1), cost=c(500,200,300,150,200,300,400,450,250,150,500,200,300,150,200,300,400,450,250,150)) ## AGGREGATE FUNCTION FROM BASE R aggregate(cost ~ marital_status, data=dat, FUN=mean) aggregate(cost ~ marital_status + gender, data=dat, FUN=mean) aggregate(cost ~ marital_status + credit + gender, data=dat, FUN=mean)   ## SUMMARY BY IN DOBY: library(doBy) summaryBy(cost ~ marital_status, data=dat, FUN=c(mean, sd)) summaryBy(cost ~ gender, data=dat, FUN=c(mean, sd)) summaryBy(cost ~ credit, data=dat, FUN=c(mean, sd))   ## DDPLY IN PLYR library(plyr) ddply(dat, .(credit), "nrow") ddply(dat, .(credit, gender), "nrow") ddply(dat, .(marital_status), summarise, avg=mean(cost)) ddply(dat, .(marital_status, gender), summarise, avg=mean(cost)) ddply(dat, .(marital_status, gender, credit), summarise, avg=mean(cost))   ## DPLYR PACKAGE library(dplyr) Good = filter(dat, credit=="good") Good arrange(Good, desc(cost)) select(Good, owns_home, cost) mutate(Good, New_Value=cost/5) by.type <- group_by(Good, gender) summarise(by.type, num.types = n(), counts = sum(cost))   ## SQLDF PACKAGE library(sqldf) sqldf("SELECT gender, COUNT(*) FROM dat GROUP BY gender") sqldf("SELECT gender, credit, COUNT(*) FROM dat GROUP BY gender, credit") sqldf("SELECT gender, credit, COUNT(*), AVG(cost) FROM dat GROUP BY gender, credit")

 


To leave a comment for the author, please follow the link and comment on his blog: Mathew Analytics » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Practical Data Science with R: Release date announced

Tue, 2014-03-25 20:26

(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version (when the eBook is released), in all three formats: PDF, ePub, and Kindle.

If you haven’t yet, preorder it now!

Related posts:

  1. Data Science, Machine Learning, and Statistics: what is in a name?
  2. Setting expectations in data science projects
  3. Please stop using Excel-like formats to exchange data

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Using R: quickly calculating summary statistics from a data frame

Tue, 2014-03-25 18:32

(This article was first published on There is grandeur in this view of life » R, and kindly contributed to R-bloggers)

A colleague asked: I have a lot of data in a table and I’d like to pull out some summary statistics for different subgroups. Can R do this for me quickly?

Yes, there are several pretty convenient ways. I wrote about this in the recent post on the barplot, but as this is an important part of quickly getting something useful out of R, just like importing data, I’ll break it out into a post of its own. I will present a solution that uses the plyr and reshape2 packages. You can do the same with base R, and there’s nothing wrong with base R, but I find that plyr and reshape2 makes things convenient and easy to remember. The apply family of functions in base R does the same job as plyr, but with a slightly different interface. I strongly recommend beginners to begin with plyr or the apply functions, and not what I did initially, which was nested for loops and hard bracket indexing.

We’ll go through and see what the different parts do. First, simulate some data. Again, when you do this, you usually have a table already, and you can ignore the simulation code. Usually a well formed data frame will look something this: a table where each observation is a unit such as an individual, and each column gives the data about the individual. Here, we imagine two binary predictors (sex and treatment) and two continuous response variables.

data <- data.frame(sex = c(rep(1, 1000), rep(2, 1000)), treatment = rep(c(1, 2), 1000), response1 = rnorm(2000, 0, 1), response2 = rnorm(2000, 0, 1)) head(data)
sex treatment response1 response2 1 1 1 -0.15668214 -0.13663012 2 1 2 -0.40934759 -0.07220426 3 1 1 0.07103731 -2.60549018 4 1 2 0.15113270 1.81803178 5 1 1 0.30836910 0.32596016 6 1 2 -1.41891407 1.12561812

Now, calculating a function of the response in some group is straightforward. Most R functions are vectorised by default and will accept a vector (that is, a column of a data frame). The subset function lets us pull out rows from the data frame based on a logical expression using the column names. Say that we want mean, standard deviation and a simple standard error of the mean. I will assume that we have no missing values. If you have, you can add na.rm=T to the function calls. And again, if you’ve got a more sophisticated model, these might not be the standard errors you want. Then pull them from the fitted model instead.

mean(subset(data, sex == 1 & treatment == 1)$response1) sd(subset(data, sex == 1 & treatment == 1)$response1) sd(subset(data, sex == 1 & treatment == 1)$response1)/ sqrt(nrow(subset(data, sex == 1 & treatment == 1)))

Okay, but doing this for each combination of the predictors and responses is no fun and requires a lot of copying and pasting. Also, the above function calls are pretty messy with lots of repetition. There is a better way, and that’s where plyr and reshape2 come in. We load the packages. The first time you’ll have to run install.packages, as usual.

library(plyr) library(reshape2)

First out, the melt function from rehape2. Look at the table above. It’s reasonable in many situations, but right now, it would be better if we put both the response variables in the same column. If it doesn’t seem so useful, trust me and see below. Melt will take all the columns except the ones we single out as id variables and put them in the same column. It makes sense to label each row with the sex and treatment of the individual. If we had an actual unit id column, it would go here as well:

melted <- melt(data, id.vars=c("sex", "treatment"))

The resulting ”melted” table looks like this. Instead of the response variables separately we get a column of values and a column indicating which variable the value comes from.

sex treatment variable value 1 1 1 response1 -0.15668214 2 1 2 response1 -0.40934759 3 1 1 response1 0.07103731 4 1 2 response1 0.15113270 5 1 1 response1 0.30836910 6 1 2 response1 -1.41891407

Now it’s time to calculate the summary statistics again. We will use the same functions as above to do the actual calculations, but we’ll use plyr to automatically apply them to all the subsets we’re interested in. This is sometimes called the split-apply-combine approach: plyr will split the data frame into subsets, apply the function of our choice, and then collect the results for us. The first thing to notice is the function name. All the main plyr functions are called something with -ply. The letters stand for the input and return data type: ddply works on a data frame and returns a data frame. It’s probably the most important member of the family.

The arguments to ddply are the data frame to work on (melted), a vector of the column names to split on, and a function. The arguments after the function name are passed on to the function. Here we want to split in subsets for each sex, treatment and response variable. The function we apply is summarise, which makes a new data frame with named columns based on formulas, allowing us to use the column names of the input data frame in formulas. In effect it does exactly what the name says, summarises a data frame. And in this instance, we want to calculate the mean, standard deviation and standard error of the mean, so we use the above function calls, using value as the input. Run the ddply call, and we’re done!

ddply(melted, c("sex", "treatment", "variable"), summarise, mean = mean(value), sd = sd(value), sem = sd(value)/sqrt(length(value)))
sex treatment variable mean sd sem 1 1 1 response1 0.021856280 1.0124371 0.04527757 2 1 1 response2 0.045928150 1.0151670 0.04539965 3 1 2 response1 -0.065017971 0.9825428 0.04394065 4 1 2 response2 0.011512867 0.9463053 0.04232006 5 2 1 response1 -0.005374208 1.0095468 0.04514830 6 2 1 response2 -0.051699624 1.0154782 0.04541357 7 2 2 response1 0.046622111 0.9848043 0.04404179 8 2 2 response2 -0.055257295 1.0134786 0.04532414

Postat i:computer stuff, data analysis, english Tagged: #blogg100, plyr, R, reshape2

To leave a comment for the author, please follow the link and comment on his blog: There is grandeur in this view of life » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A Thumbnail History of Ensemble Methods

Tue, 2014-03-25 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

By Mike Bowles

Ensemble methods are the backbone of machine learning techniques. However, it can be a daunting subject for someone approaching it for the first time, so we asked Mike Bowles, machine learning expert and serial entrepreneur to provide some context.

Ensemble Methods are among the most powerful and easiest to use of predictive analytics algorithms and R programming language has an outstanding collection that includes the best performers – Random Forest, Gradient Boosting and Bagging as well as big data versions that are available through Revolution Analytics. 

The phrase “Ensemble Methods” generally refers to building a large number of somewhat independent predictive models and then combining them by voting or averaging to yield very high performance.  Ensemble methods have been called crowd sourcing for machines.  Bagging, Boosting and Random Forest all have the objective of improving performance beyond what’s achievable with a binary decision tree, but the algorithms take different approaches to improving performance. 

Bagging and Random Forests were developed to overcome variance and stability issues with binary decision trees.  The term “Bagging” was coined by the late Professor Leo Breiman of Berkeley.  Professor Breiman was instrumental in the development of decision trees for statistical learning and recognized that training and averaging a multitude of trees on different random subsets of data would reduce variance and improve stability.  The term comes from a shortening of “Bootstrap Aggregating” and the relation to bootstrap sampling is obvious. 

Tin Kam Ho of Bell Labs developed Random Decision Forests as an example of a random subspace method.  The idea with Random Decision Forests was to train binary decision trees on random subsets of attributes (random subsets of columns of the training data).  Breiman and Cutler’s Random Forests method combined random subsampling of rows (Bagging) with random subsampling of columns.  The randomForest package in R was written by Professor Breiman and Adele Cutler. 

Boosting methods grew out of work on computational learning theory.  The first algorithm of this type was called AdaBoost by its authors Freund and Shapire.  In the introduction to their paper they use the example of friends going to the race track regularly and betting on the horses.  One of the friends decides to devise a method of betting a fraction of his money with each of his friends and adjusting the fractions based on results so that his performance over time approaches the performance of his most winning friend.  The goal with Boosting is maximum predictive performance. 

AdaBoost stood for a long time as the best example of a black box algorithm.  A practitioner could apply it without much parameter tweaking and it would yield superior performer while almost never overfitting.  It was a little mysterious.  In some of Professor Breiman’s papers on Random Forests, he compares performance with AdaBoost.  Professor Jerome Friedman and his Stanford colleagues Professors Hastie and Tibshirani authored a paper in 2000 that attempted to understand why AdaBoost was so successful.  The paper caused a storm of controversy.  The comments on the paper were longer than the paper itself.  Most of the comments centered around whether boosting was just another way of reducing variance or was doing something different by focusing on error reduction.  Professor Friedman offered several arguments and examples to demonstrate that boosting is more than just another variance reduction technique, but commenters did not reach a consensus.

The understanding that Professor Friedman and his colleagues developed from analyzing AdaBoost led him to formulate the boosting method more directly.  That led to a number of several valuable extensions and improvements beyond AdaBoost – the ability to handle regression and multiclass problems, other performance measures besides squared error etc.  These features (and new ones being developed) are all included in the excellent R package gbm by Greg Ridgeway.

Today, ensemble methods form the backbone of many data science applications. Random Forests has become particularly popular with modelers competing in Kaggle competitions and according to google trends Random Forests has surpassed AdaBoost in popularity.

In a future post we will explore several of these algorithms in R.

References

Breiman  Leo, Bagging Predictors, Technical Report No. 421, Sept 1994, Dept of Statistics University of California, Berkeley.

Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 278-282, 1995

Breiman, Leo Random Forest – Random Features, Technical Report No. 567, Sept 1999, Dept of Statistics University of California, Berkeley.

Freund, Yoav and Schapire , Robert E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55. 1997.

Friedman, Jerome, Hastie, Trevor, Tibshirani, Robert Additive Logistic Regression: A Statistical View of Boosting, Ann Stat, Vol 28, Number 2, (2000), 337-655

 

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Interactive Discovery of Research Affiliates JoPM Paper

Tue, 2014-03-25 11:21

(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)

In my previous post More on Rebalancing | With Data from Research Affiliates , I did some really basic visualizations, but I thought this data would be great for some more powerful interactive discovery using an interesting javascript SQL-like query language objeq along with the d3.js charting library dimple.js.  Next, I hope to extend to use lodash or lazy.js. This exercise helps me think

To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Filtering Data with L2 Regularisation

Tue, 2014-03-25 06:58

(This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers)

I have just finished reading Momentum Strategies with L1 Filter by Tung-Lam Dao. The smoothing results presented in this paper are interesting and I thought it would be cool to implement the L1 and L2 filtering schemes in R. We’ll start with the L2 scheme here because it has an exact solution and I will follow up with the L1 scheme later on.

Formulation of the Problem

Consider some time series data consisting of points, , where is a smooth signal, is noise and is the combined noisy sign. If we have observations of , how can get back to an estimate of ?

The Hodrick-Prescott filter works by minimising the objective function

.

The regularisation parameter, , balances the contributions of the first and second summations, where the first is the sum of squared residuals and the second is the sum of squared curvatures in the filtered signal (characterised by the central difference approximation for the second derivative). A small value for causes the residuals to dominate the optimisation problem. A large value for will result in a solution which minimises curvature.

Implementation and Test Data

Implementing a function to perform the optimisation is pretty simple.

> l2filter.optim <- function(x, lambda = 0.0) { + objective <- function(y, lambda) { + n <- length(x) + + P1 = 0.5 * sum((y - x)**2) + # + P2 = 0 + for (i in 2:(n-1)) { + P2 = P2 + (y[i-1] - 2 * y[i] + y[i+1])**2 + } + # + P1 + lambda * P2 + } + # + optim(x, objective, lambda = lambda, method = "BFGS")$par + }

It has a nested objective function. The BFGS method is specified for optim() because the Nelder and Mead optimisation scheme converged too slowly.

First we’ll try this out on some test data.

> N <- 20 > > set.seed(1) > > (y <- 1:N + 10 * runif(N)) [1] 3.6551 5.7212 8.7285 13.0821 7.0168 14.9839 16.4468 14.6080 15.2911 10.6179 13.0597 [12] 13.7656 19.8702 17.8410 22.6984 20.9770 24.1762 27.9191 22.8004 27.7745

If we use then regularisation has no effect and the objective function is minimised when . Not surprisingly, in this case the filtered signal is the same as the original signal.

> l2filter.optim(y, 0) [1] 3.6551 5.7212 8.7285 13.0821 7.0168 14.9839 16.4468 14.6080 15.2911 10.6179 13.0597 [12] 13.7656 19.8702 17.8410 22.6984 20.9770 24.1762 27.9191 22.8004 27.7745

If, on the other hand, we use a large value for the regularisation parameter then the filtered signal is significantly different.

> l2filter.optim(y, 100) [1] 5.8563 7.0126 8.1579 9.2747 10.3484 11.3835 12.3677 13.3067 14.2269 15.1607 16.1463 [12] 17.1989 18.3183 19.4873 20.6963 21.9274 23.1729 24.4203 25.6621 26.9082

A plot is the most sensible way to visualise the effects of . Below the original data (circles) are plotted along with the filtered data for values of from 0.1 to 100. In the top panel, weak regularisation results in a filtered signal which is not too different from the original. At the other extreme, the bottom panel shows strong regularisation where the filtered signal is essentially a straight line (all curvature has been removed). The other two panels represent intermediate levels of regularisation and it is clear how the original signal is being smoothed to varying degrees.

Matrix Implementation

As it happens there is an exact solution to the Hodrick-Prescott optimisation problem, which involves some simple matrix algebra. The core of the solution is a band matrix with a right bandwidth of 2. The non-zero elements on each row are 1, -2 and 1. The function below constructs this matrix in a rather naive way. However, it is simply for illustration: we will look at a better implementation using sparse matrices.

l2filter.matrix <- function(x, lambda = 0.0) { n <- length(x) I = diag(1, nrow = n) D = matrix(0, nrow = n - 2, ncol = n) # for (i in 1:(n-2)) { D[i, i:(i+2)] = c(1, -2, 1) } c(solve(I + 2 * lambda * t(D) %*% D) %*% x) }

Applying this function to the same set of test data, we get results consistent with those from optimisation.

> l2filter.matrix(y, 100) [1] 5.8563 7.0126 8.1579 9.2747 10.3484 11.3835 12.3677 13.3067 14.2269 15.1607 16.1463 [12] 17.1989 18.3183 19.4873 20.6963 21.9274 23.1729 24.4203 25.6621 26.9082

In principle the matrix solution is much more efficient than the optimisation. However, an implementation using a dense matrix (as above) would not be feasible for a data series of any appreciable length due to memory constraints. A sparse matrix implementation does the trick though.

library(Matrix) l2filter.sparse <- function(x, lambda = 0.0) { n <- length(x) I = Diagonal(n) D = bandSparse(n = n - 2, m = n, k = c(0, 1, 2), diagonals = list(rep(1, n), rep(-2, n), rep(1, n))) (solve(I + 2 * lambda * t(D) %*% D) %*% x)[,1] }

Again we can check that this gives the right results.

> l2filter.sparse(y, 100) [1] 5.8563 7.0126 8.1579 9.2747 10.3484 11.3835 12.3677 13.3067 14.2269 15.1607 16.1463 [12] 17.1989 18.3183 19.4873 20.6963 21.9274 23.1729 24.4203 25.6621 26.9082 Application: S&P500 Data

So, let’s take this out for a drive in the real world. We’ll get out hands on some S&P500 data from Quandl.

> library(Quandl) > > Quandl.auth("xxxxxxxxxxxxxxxxxxxx") > > SP500 = Quandl("YAHOO/INDEX_GSPC", start_date = "1994-01-01", end_date = "2014-01-01") > # > SP500 = SP500[, c(1, 5)]

Then systematically apply the filter with a range of regularisation parameters scaling from 0.1 to 100000 in multiples of 10. The results are plotted below. In each panel the grey data reflect the raw daily values for the S&P500 index. Superimposed on top of these are the results of filtering the signal using the specified regularisation parameter. As anticipated, larger values of result in a smoother curve since the filter is more heavily penalised for curvature in the filtered signal.

I think that the results look rather compelling. The only major drawback to this filter seems to be the fact, if used dynamically, the algorithm can (and most likely will) cause previous states to change. If used, for example, as the basis for an indicator on a chart, this would cause repainting of historical values.

To leave a comment for the author, please follow the link and comment on his blog: Exegetic Analytics » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Sankey diagrams with googleVis

Tue, 2014-03-25 02:56

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

Sankey diagrams are great for visualising flows from one set of data values to another. Although named after Irish Captain Matthew Henry Phineas Riall Sankey, who used this type of diagram in 1898 to show the energy efficiency of a steam engine, the best know Sankey diagram is probably Charles Minard's Map of Napoleon's Russian Campaign of 1812, which he actually produced in 1869.

Thomas Rahlf: Datendesign mit R
The above example from Thomas Rahlf's book Datendesign mit R shows that Minard's plot can be reproduced with base graphics in R. Aaron Berdanier posted in 2010 the SankeyR function and Erik Andrulis published the riverplot package on CRAN that allows users to create static Sankey charts as well.

Interactive Sankey diagram can be generated with rCharts and now also with googleVis (version >= 0.5.0). For my a first example I use UK visitor data from VisitBritain.org. The following diagram visualises the flow of visitors in 2012; where they came from and which parts of the UK they visited. This example illustrates the key concept already. I need a data frame with three columns that explains the flow of data from a source to a target and the strength or weight of the connection.

Loading


My next example uses a graph data set that I visualise in the same way again, but here I start to play around with the various parameters of the Google API.

Loading


As stated by Google, the Sankey chart may be undergoing substantial revisions in future Google Charts releases.

For more information and installation instructions see the googleVis project site and Google documentation.

Session InfoR version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] googleVis_0.5.0-4 igraph_0.7.0

loaded via a namespace (and not attached):
[1] RJSONIO_1.0-3 tools_3.0.3

To leave a comment for the author, please follow the link and comment on his blog: mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Free eBook on Big Data and Data Science

Mon, 2014-03-24 17:15

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The fine folks behind the Big Data Journal have just published a new e-book Big Data: Harnessing the Power of Big Data Through Education and Data-Driven Decision Making. (Note: Adobe Flash is required to view the e-book.) In the eBook, you'll find the following technical papers on the topics of Big Data, Data Science, and R:

  • Data Science and its Relationship to Big Data and Data-Driven Decision Making, by Foster Provost and Tom Fawcett. 
  • Predictive Modeling With Big Data: Is Bigger Really Better?, by Enric Junqué de Fortuny, David Martens, and Foster Provost. 
  • Educating the Next Generation of Data Scientists, a roundtable discussion including Edd Dumbill, Elizabeth D. Liddy, Jeffrey Stanton, Kate Mueller, and Shelly Farnham
  • Delivering Value from Big Data with Revolution R Enterprise and Hadoop, by Thomas Dinsmore and Bill Jacobs.

There's also a video introduction to the papers from yours truly. View the e-book — sponsored by Revolution Analytics — here: Big Data: Harnessing the Power of Big Data Through Education and Data-Driven Decision Making

To leave a comment for the author, please follow the link and comment on his blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A Note on the Johnson-Lindenstrauss Lemma

Mon, 2014-03-24 11:55

(This article was first published on John Myles White » Statistics, and kindly contributed to R-bloggers)

Introduction

A recent thread on Theoretical CS StackExchange comparing the Johnson-Lindenstrauss Lemma with the Singular Value Decomposition piqued my interest enough that I decided to spend some time last night reading the standard JL papers. Until this week, I only had a vague understanding of what the JL Lemma implied. I previously mistook the JL Lemma for a purely theoretical result that established the existence of distance-preserving projections from high-dimensional spaces into low-dimensional spaces.

This vague understanding of the JL Lemma turns out to be almost correct, but it also led me to neglect the most interesting elements of the literature on the JL Lemma: the papers on the JL Lemma do not simply establish the existence of such projections, but also provide (1) an explicit bound on the dimensionality required for a projection to ensure that it will approximately preserve distances and they even provide (2) an explicit construction of a random matrix, \(A\), that produces the desired projection.

Once I knew that the JL Lemma was a constructive proof, I decided to implement code in Julia to construct examples of this family of random projections. The rest of this post walks through that code as a way of explaining the JL Lemma’s practical applications.

Formal Statement of the JL Lemma

The JL Lemma, as stated in “An elementary proof of the Johnson-Lindenstrauss Lemma” by Dasgputa and Gupta, is the following result about dimensionality reduction:

For any \(0 < \epsilon < 1\) and any integer \(n\), let \(k\) be a positive integer such that \(k \geq 4(\epsilon^2/2 - \epsilon^3/3)^{-1}\log(n)\).

Then for any set \(V\) of \(n\) points in \(\mathbb{R}^d\), there is a map \(f : \mathbb{R}^d \to \mathbb{R}^k\) such that for all \(u, v \in V\),

$$
(1 - \epsilon) ||u - v||^2 \leq ||f(u) - f(v)||^2 \leq (1 + \epsilon) ||u - v||^2.
$$

Further this map can be found in randomized polynomial time.

To fully appreciate this result, we can unpack the abstract statement of the lemma into two components.

The JL Lemma in Two Parts

Part 1: Given a number of data points, \(n\), that we wish to project and a relative error, \(\epsilon\), that we are willing to tolerate, we can compute a minimum dimensionality, \(k\), that a projection must map a space into before it can guarantee that distances will be preserved up to a factor of \(\epsilon\).

In particular, \(k = \left \lceil{4(\epsilon^2/2 – \epsilon^3/3)^{-1}\log(n)} \right \rceil\).

Note that this implies that the dimensionality required to preserve distances depends only on the number of points and not on the dimensionality of the original space.

Part 2: Given an input matrix, \(X\), of \(n\) points in \(d\)-dimensional space, we can explicitly construct a map, \(f\), such that the distance between any pair of columns of \(X\) will not distorted by more than a factor of \(\epsilon\).

Surprisingly, this map \(f\) can be a simple matrix, \(A\), constructed by sampling \(k * d\) IID draws from a Gaussian with mean \(0\) and variance \(\frac{1}{k}\).

Coding Up The Projections

We can translate the first part of the JL Lemma into a single line of code that computes the dimensionality, \(k\), of our low-dimensional space given the number of data points, \(n\), and the error, \(\epsilon\), that we are willing to tolerate:

1 mindim(n::Integer, ε::Real) = iceil((4 * log(n)) / (ε^2 / 2 - ε^3 / 3))

Having defined this function, we can try it out on a simple problem:

1 2 mindim(3, 0.1) # => 942

This result was somewhat surprising to me: to represent \(3\) points with no more than \(10\)% error, we require nearly \(1,000\) dimensions. This reflects an important fact about the JL Lemma: it produces result that can be extremely conservative for small dimensional inputs. It’s obvious that, for data sets that contain \(3\) points in \(100\)-dimensional space, we could use a projection into \(100\) dimensions that would preserve distances perfectly.

But this observation neglects one of the essential aspects of the JL Lemma: the dimensions required by the lemma will be sufficient whether our data set contains points in \(100\)-dimensional space or points in \(10^{100}\)-dimensional space. No matter what dimensionality the raw data lies in, the JL Lemma says that \(942\) dimensions suffices to preserve the distances between \(3\) points.

I found this statement unintuitive at the start. To see that it’s true, let’s construct a random projection matrix, \(A\), that will let us confirm experimentally that the JL Lemma really works:

1 2 3 4 5 6 7 8 9 10 11 using Distributions   function projection( X::Matrix, ε::Real, k::Integer = mindim(size(X, 2), ε) ) d, n = size(X) A = rand(Normal(0, 1 / sqrt(k)), k, d) return A, k, A * X end

This projection function is sufficient to construct a matrix, \(A\), that will satisfy the assumptions of the JL Lemma. It will also return the dimensionality, \(k\), of \(A\) and the result of projecting the input, \(X\), into the new space defined by \(A\). To get a feel for how this works, we can try this out on a very simple data set:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 X = eye(3, 3)   ε = 0.1   A, k, AX = projection(X, ε) # => # ( # 942x3 Array{Float64,2}: # -0.035269 -0.0299966 -0.0292959 # -0.00501367 0.0316806 0.0460191 # 0.0633815 -0.0136478 -0.0198676 # 0.0262627 0.00187459 -0.0122604 # 0.0417169 -0.0230222 -0.00842476 # 0.0236389 0.0585979 -0.0642437 # 0.00685299 -0.0513301 0.0501431 # 0.027723 -0.0151694 0.00274466 # 0.0338992 0.0216184 -0.0494157 # 0.0612926 0.0276185 0.0271352 # ⋮ # -0.00167347 -0.018576 0.0290964 # 0.0158393 0.0124403 -0.0208216 # -0.00833401 0.0323784 0.0245698 # 0.019355 0.0057538 0.0150561 # 0.00352774 0.031572 -0.0262811 # -0.0523636 -0.0388993 -0.00794319 # -0.0363795 0.0633939 -0.0292289 # 0.0106868 0.0341909 0.0116523 # 0.0072586 -0.0337501 0.0405171 , # # 942, # 942x3 Array{Float64,2}: # -0.035269 -0.0299966 -0.0292959 # -0.00501367 0.0316806 0.0460191 # 0.0633815 -0.0136478 -0.0198676 # 0.0262627 0.00187459 -0.0122604 # 0.0417169 -0.0230222 -0.00842476 # 0.0236389 0.0585979 -0.0642437 # 0.00685299 -0.0513301 0.0501431 # 0.027723 -0.0151694 0.00274466 # 0.0338992 0.0216184 -0.0494157 # 0.0612926 0.0276185 0.0271352 # ⋮ # -0.00167347 -0.018576 0.0290964 # 0.0158393 0.0124403 -0.0208216 # -0.00833401 0.0323784 0.0245698 # 0.019355 0.0057538 0.0150561 # 0.00352774 0.031572 -0.0262811 # -0.0523636 -0.0388993 -0.00794319 # -0.0363795 0.0633939 -0.0292289 # 0.0106868 0.0341909 0.0116523 # 0.0072586 -0.0337501 0.0405171 )

According to the JL Lemma, the new matrix, \(AX\), should approximately preserve the distances between columns of \(X\). We can write a quick function that verifies this claim:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 function ispreserved(X::Matrix, A::Matrix, ε::Real) d, n = size(X) k = size(A, 1)   for i in 1:n for j in (i + 1):n u, v = X[:, i], X[:, j] d_old = norm(u - v)^2 d_new = norm(A * u - A * v)^2 @printf("Considering the pair X[:, %d], X[:, %d]...\n", i, j) @printf("\tOld distance: %f\n", d_old) @printf("\tNew distance: %f\n", d_new) @printf( "\tWithin bounds %f <= %f <= %f\n", (1 - ε) * d_old, d_new, (1 + ε) * d_old ) if !((1 - ε) * d_old <= d_old <= (1 + ε) * d_old) return false end end end   return true end

And then we can test out the results:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 ispreserved(X, A, ε) # => # Considering the pair X[:, 1], X[:, 2]... # Old distance: 2.000000 # New distance: 2.104506 # Within bounds 1.800000 <= 2.104506 <= 2.200000 # Considering the pair X[:, 1], X[:, 3]... # Old distance: 2.000000 # New distance: 2.006130 # Within bounds 1.800000 <= 2.006130 <= 2.200000 # Considering the pair X[:, 2], X[:, 3]... # Old distance: 2.000000 # New distance: 1.955495 # Within bounds 1.800000 <= 1.955495 <= 2.200000

As claimed, the distances are indeed preserved up to a factor of \(\epsilon\). But, as we noted earlier, the JL lemma has a somewhat perverse consequence for our \(3×3\) matrix: we’ve expanded our input into a \(942×3\) matrix rather than reduced its dimensionality.

To get meaningful dimensionality reduction, we need to project a data set from a space that has more than \(942\) dimensions. So let’s try out a \(50,000\)-dimensional example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 X = eye(50_000, 3)   A, k, AX = projection(X, ε)   ispreserved(X, A, ε) # => # Considering the pair X[:, 1], X[:, 2]... # Old distance: 2.000000 # New distance: 2.021298 # Within bounds 1.800000 <= 2.021298 <= 2.200000 # Considering the pair X[:, 1], X[:, 3]... # Old distance: 2.000000 # New distance: 1.955502 # Within bounds 1.800000 <= 1.955502 <= 2.200000 # Considering the pair X[:, 2], X[:, 3]... # Old distance: 2.000000 # New distance: 1.988945 # Within bounds 1.800000 <= 1.988945 <= 2.200000

In this case, the JL Lemma again works as claimed: the pairwise distances between columns of \(X\) are preserved. And we’ve done this while reducing the dimensionality of our data from \(50,000\) to \(942\). Moreover, this same approach would still work if the input space had \(10\) million dimensions.

Conclusion

Contrary to my naive conception of the JL Lemma, the literature on the lemma not only tells us that, abstractly, distances can be preserved by dimensionality reduction techniques. It tells how to perform this reduction — and the mechanism is both simple and general.

To leave a comment for the author, please follow the link and comment on his blog: John Myles White » Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Hack, a template for improving code reliability

Mon, 2014-03-24 11:20

(This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

My sole prediction for 2014 has come true, Facebook have announced the Hack language (if you don’t know that HHVM is the Hip Hop Virtual Machine you are obviously not a trendy developer).

This language does not follow the usual trend in that it looks useful, rather than being fashion fluff for corporate developers to brag about. Hack extends an existing language (don’t the Facebook developers know about not-invented-here?) by adding features to improve code reliability (how uncool is that) and stuff that will sometimes enable faster code to be generated (which has always been cool).

Well done Facebook. I hope this is the start of a trend of adding features to a language that help developers improve code reliability.

Hack extends PHP to allow programmers to express intent, e.g., this variable only ever holds integer values. Compilers can then check that the code follows the intent and flag when it doesn’t, e.g., a string is assigned to the variable intended to only hold integers. This sounds so trivial to be hardly worth bothering about, but in practice it catches lots of minor mistakes very quickly and saves huge amounts of time that would otherwise be spent debugging code at runtime.

Yes, Hack has added static typing into a dynamically typed language. There is a generally held view that static typing prevents programmers doing what needs to be done and that dynamic typing is all about freedom of expression (who could object to that?) Static typing got a bad name because early languages using it were too disciplinarian in a few places and like the very small stone in a runners shoe these edge cases came to dominate thinking. Dynamic languages are great for small programs and showing off to spotty teenagers students, but are expensive to maintain and a nightmare to work with on 10K+ line systems.

The term gradual typing is a good description for Hack’s type system. Developers can take existing PHP code and gradually give types to existing variables in a piecemeal fashion or add new code that uses types into code that does not. The type checker figures out what it can and does not get too upperty about complaining. If a developer can be talked into giving such a system a try they quickly learn that they can save a lot of debugging time by using it.

I would like to see gradual typing introduced into R, but perhaps the language does not cause its users enough grief to make this happen (it is R’s libraries that cause the grief):

  • Compared to PHP’s quirks the R quirk’s are pedestrian. In the interest of balance I should point out that Javascript can at times be as quirky as PHP and C++ error messages can be totally incomprehensible to everybody (including the people who wrote the compiler).
  • R programs are often small, i.e., 100 lines’ish. It is only when programs, written in dynamically typed languages, start to exceed around 10k+ lines that they start to fall in on themselves unless that one person who has everything in his head is there to hold it all up.

However, there is a sort of precedent: Perl programs tend to be short (although I don’t think they are as short as R) and it gradually introduced the option of stronger typing. But Perk did/does have one person who was the recognized language designer who could lead the process; R has a committee.

To leave a comment for the author, please follow the link and comment on his blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Image CGI with R

Mon, 2014-03-24 11:15

(This article was first published on BioStatMatt » R, and kindly contributed to R-bloggers)

I received an email recently with a question about using R in the common gateway interface (CGI) framework to create and pass image data to the web browser. I posted to stackoverflow about this some time ago, but had forgotten the details. The trouble is that R's graphics devices only write image data to file, rather than to a buffer or R connection. Hence, when passing image data via CGI, it is necessary to write an intermediary image file, read the file, and then output the data to STDOUT. Unfortunately, R will not allow you to write binary data to STDOUT (for example, try writeBin(as.raw("0xFF"), stdout()). So, this becomes even more complicated! One solution is to open a (binary) pipe connection to the Linux cat command, which sends it's output to STDOUT, and write the image data to the pipe connection. Of course, this is a platform dependent solution. Below is a CGI script that implements this:

#!/bin/sh REXEC="/usr/local/bin/R --vanilla --slave" $REXEC <<EOF # create a random file name pngfile <- paste0(format(Sys.time(), "%Y%m%d%H%M%S"), paste(sample(letters,10), collapse=""), ".png") # create temporary graphic file png(pngfile, type = "cairo") x <- seq(-10, 10, length= 30) y <- x f <- function(x, y) { r <- sqrt(x^2+y^2); 10 * sin(r)/r } z <- outer(x, y, f) z[is.na(z)] <- 1 op <- par(bg = rgb(1,1,1,0)) persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue") persp(x, y, z, theta = 30, phi = 30, expand = 0.5, col = "lightblue", ltheta = 120, shade = 0.75, ticktype = "detailed", xlab = "X", ylab = "Y", zlab = "Sinc( r )") invisible(dev.off()) # write headers pngsize <- file.info(pngfile)[["size"]] cat("Content-type: image/png\n") cat(paste("Content-length: ", pngsize, "\n\n", sep="")) # open pipe to stdout and pass image data con <- pipe("cat", "wb") writeBin(readBin(pngfile, 'raw', n=pngsize), con) flush(con) close(con) # remove intermediate graphic invisible(file.remove(pngfile)) EOF ###

To leave a comment for the author, please follow the link and comment on his blog: BioStatMatt » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R jobs (March 24th)

Mon, 2014-03-24 07:42

Earlier this month I have announced the launch of R-users.com, a new R jobs website. My intention is to post around twice a month on new R jobs posted on the website.

You are invited to use this link, to post new R jobs. Registration is less than 10 seconds.

Here are the new R jobs:

 

  1. Full-Time
    Data Scientist, Schroeder Institute
    American Legacy Foundation 
    WashingtonDistrict of Columbia, United States
    21 Mar2014
  2. Freelance
    Senior Algorithm Data Scientist
    MyPermissions 
    Tel AvivTel Aviv District, Israel
    18 Mar2014
  3. Freelance
    Software Developer
    Martina Morris / University of Washington 
    Anywhere
    18 Mar2014
  4. Full-Time
    Fully-funded PhD Studentship in Joint Modelling at Northumbria University
    Northumbria University
    Newcastle upon TyneEngland, United Kingdom
    14 Mar2014
  5. Part-Time
    Part-time Technical Writer/Editorial Assistant
    Vivian Zhang/SupStat Inc 
    New YorkNew York, United States
    7 Mar2014
  6. Full-Time
    R package developer
    University of Washington, statnet research group 
    Anywhere
    7 Mar2014
  7. Internship
    Intern/Researcher in Aspinalls Group
    Aspinalls Group 
    EnglandUnited Kingdom
    4 Mar2014
Categories: Methodology Blogs

Experimenting With R – Point to Point Mapping With Great Circles

Mon, 2014-03-24 07:17

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

I’ve started doodling again… This time, around maps, looking for recipes that make life easier plotting lines to connect points on maps. The most attractive maps seem to use great circles to connect one point with another, these providing the shortest path between two points when you consider the Earth as a sphere.

Here’s one quick experiment (based on the Flowing Data blog post How to map connections with great circles), for an R/Shiny app that allows you to upload a CSV file containing a couple of location columns (at least) and an optional “amount” column, and it’ll then draw lines between the points on each row.

The app requires us to solve several problems, including:

  • how to geocode the locations
  • how to plot the lines as great circles
  • how to upload the CSV file
  • how to select the from and two columns from the CSV file
  • how to optionally select a valid numerical column for setting line thickness
    • Let’s start with the geocoder. For convenience, I’m going to use the Google geocoder via the geocode() function from the ggmap library.

      #Locations are in two columns, *fr* and *to* in the *dummy* dataframe #If locations are duplicated in from/to columns, dedupe so we don't geocode same location more than once locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F) #Run the geocoder against each location, then transpose and bind the results into a dataframe cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F)))

      The locs data is a vector of locations:

      place 1 London, UK 2 Cambridge,UK 3 Paris,France 4 Sydney, Australia 5 Paris, France 6 New York,US 7 Cape Town, South Africa

      The sapply(locs$place,geocode, USE.NAMES=F) function returns data that looks like:

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] lon -0.1254872 0.121817 2.352222 151.207 2.352222 -74.00597 18.42406 lat 51.50852 52.20534 48.85661 -33.86749 48.85661 40.71435 -33.92487

      The transpose (t() gives us:

      lon lat [1,] -0.1254872 51.50852 [2,] 0.121817 52.20534 [3,] 2.352222 48.85661 [4,] 151.207 -33.86749 [5,] 2.352222 48.85661 [6,] -74.00597 40.71435 [7,] 18.42406 -33.92487

      The cbind() binds each location with its lat and lon value:

      place lon lat 1 London, UK -0.1254872 51.50852 2 Cambridge,UK 0.121817 52.20534 3 Paris,France 2.352222 48.85661 4 Sydney, Australia 151.207 -33.86749 5 Paris, France 2.352222 48.85661 6 New York,US -74.00597 40.71435 7 Cape Town, South Africa 18.42406 -33.92487

      Code that provides a minimal example for uploading the data from a CSV file on the desktop to the Shiny app, then creating dynamic drop lists containing column names, can be found here: Simple file geocoder (R/shiny app).

      The following snippet may be generally useful for getting a list of column names from a data frame that correspond to numerical columns:

      #Get a list of column names for numerical columns in data frame df nums <- sapply(df, is.numeric) names(nums[nums])

      The code for the full application can be found as a runnable gist in RStudio from here: R/Shiny app – great circle mapping. [In RStudio, install.packages("shiny"); library(shiny); runGist(9690079). The gist contains a dummy data file if you want to download it to try it out...]

      Here’s the code explicitly…

      The global.R file loads the necessary packages, installing them if they are missing:

      #global.R ##This should detect and install missing packages before loading them - hopefully! list.of.packages <- c("shiny", "ggmap","maps","geosphere") new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] if(length(new.packages)) install.packages(new.packages) lapply(list.of.packages,function(x){library(x,character.only=TRUE)})

      The ui.R file builds the Shiny app’s user interface. The drop down column selector lists are populated dynamically with the names of the columns in the data file once it is uploaded. An optional Amount column can be selected – the corresponding list only displays the names of numerical columns. (The lists of location columns to be geocoded should really be limited to non-numerical columns.) The action button prevents the geocoding routines firing until the user is ready – select the columns appropriately before geocoding (error messages are not handled very nicely;-)

      #ui.R shinyUI(pageWithSidebar( headerPanel("Great Circle Map demo"), sidebarPanel( #Provide a dialogue to upload a file fileInput('datafile', 'Choose CSV file', accept=c('text/csv', 'text/comma-separated-values,text/plain')), #Define some dynamic UI elements - these will be lists containing file column names uiOutput("fromCol"), uiOutput("toCol"), #Do we want to make use of an amount column to tweak line properties? uiOutput("amountflag"), #If we do, we need more options... conditionalPanel( condition="input.amountflag==true", uiOutput("amountCol") ), conditionalPanel( condition="input.amountflag==true", uiOutput("lineSelector") ), #We don't want the geocoder firing until we're ready... actionButton("getgeo", "Get geodata") ), mainPanel( tableOutput("filetable"), tableOutput("geotable"), plotOutput("geoplot") ) ))

      The server.R file contains the server logic for the app. One thing to note is the way we isolate some of the variables in the geocoder reactive function. (Reactive functions fire when one of the external variables they contain changes. To prevent the function firing when a variable it contains changes, we need to isolate it. (See the docs for me; for example, Shiny Lesson 7: Reactive outputs or Isolation: avoiding dependency.)

      #server.R shinyServer(function(input, output) { #Handle the file upload filedata <- reactive({ infile <- input$datafile if (is.null(infile)) { # User has not uploaded a file yet return(NULL) } read.csv(infile$datapath) }) #Populate the list boxes in the UI with column names from the uploaded file output$toCol <- renderUI({ df <-filedata() if (is.null(df)) return(NULL) items=names(df) names(items)=items selectInput("to", "To:",items) }) output$fromCol <- renderUI({ df <-filedata() if (is.null(df)) return(NULL) items=names(df) names(items)=items selectInput("from", "From:",items) }) #If we want to make use of an amount column, we need to be able to say so... output$amountflag <- renderUI({ df <-filedata() if (is.null(df)) return(NULL) checkboxInput("amountflag", "Use values?", FALSE) }) output$amountCol <- renderUI({ df <-filedata() if (is.null(df)) return(NULL) #Let's only show numeric columns nums <- sapply(df, is.numeric) items=names(nums[nums]) names(items)=items selectInput("amount", "Amount:",items) }) #Allow different line styles to be selected output$lineSelector <- renderUI({ radioButtons("lineselector", "Line type:", c("Uniform" = "uniform", "Thickness proportional" = "thickprop", "Colour proportional" = "colprop")) }) #Display the data table - handy for debugging; if the file is large, need to limit the data displayed [TO DO] output$filetable <- renderTable({ filedata() }) #The geocoding bit... Isolate variables so we don't keep firing this... geodata <- reactive({ if (input$getgeo == 0) return(NULL) df=filedata() if (is.null(df)) return(NULL) isolate({ dummy=filedata() fr=input$from to=input$to locs=data.frame(place=unique(c(as.vector(dummy[[fr]]),as.vector(dummy[[to]]))),stringsAsFactors=F) cbind(locs, t(sapply(locs$place,geocode, USE.NAMES=F))) }) }) #Weave the goecoded data into the data frame we made from the CSV file geodata2 <- reactive({ if (input$getgeo == 0) return(NULL) df=filedata() if (input$amountflag != 0) { maxval=max(df[input$amount],na.rm=T) minval=min(df[input$amount],na.rm=T) df$b8g43bds=10*df[input$amount]/maxval } gf=geodata() df=merge(df,gf,by.x=input$from,by.y='place') merge(df,gf,by.x=input$to,by.y='place') }) #Preview the geocoded data output$geotable <- renderTable({ if (input$getgeo == 0) return(NULL) geodata2() }) #Plot the data on a map... output$geoplot<- renderPlot({ if (input$getgeo == 0) return(map("world")) #Method pinched from: http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/ map("world") df=geodata2() pal <- colorRampPalette(c("blue", "red")) colors <- pal(100) for (j in 1:nrow(df)){ inter <- gcIntermediate(c(df[j,]$lon.x[[1]], df[j,]$lat.x[[1]]), c(df[j,]$lon.y[[1]], df[j,]$lat.y[[1]]), n=100, addStartEnd=TRUE) #We could possibly do more styling based on user preferences? if (input$amountflag == 0) lines(inter, col="red", lwd=0.8) else { if (input$lineselector == 'colprop') { maxval <- max(df$b8g43bds) minval= min(df$b8g43bds) colindex <- round( (df[j,]$b8g43bds[[1]]/10) * length(colors) ) lines(inter, col=colors[colindex], lwd=0.8) } else if (input$lineselector == 'thickprop') { lines(inter, col="red", lwd=df[j,]$b8g43bds[[1]]) } else lines(inter, col="red", lwd=0.8) } } }) })

      So that’s the start of it… this app could be further developed in several ways, for example allowing the user to filter or colour displayed lines according to factor values in a further column (commodity type, for example), or produce a lattice of maps based on facet values in a column.

      I also need to figure how to to save maps, and maybe produce zoomable ones. If geocoded points all lay within a blinding box limited to a particular geographical area, scaling the map view to show just that area might be useful.

      Other techniques might include using proportional symbols (circles) at line landing points to show the sum of values incoming to that point, or some of values outgoing, or the difference between the two; (maybe use green for incoming outgoing, then size by the absolute difference?)


      To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Estimating Variance as a Function of Treatment Rank Class

Mon, 2014-03-24 03:34

(This article was first published on Econometrics by Simulation, and kindly contributed to R-bloggers)

Imagine that we have a treatment that we give to five different groups of individuals.  Each individual has a variable response which as a unique mean and variance based on the treatment.  We do not know how the means will change but we believe the variance of responses will expand depending upon what level of treatment the individual gets.  We would like to expressly model both the differences in means and that of the variances.

This code was formulated in response to a question posted on CrossValidated.

We want to solve
$$    \max_{\bf {\hat\beta,\hat\gamma}} (\sum_{i=1}(ln(D(x_i, \hat\mu, \hat\gamma_0+\hat\gamma_1 rank))) $$

# Specify how many individuals are in each of our groups
nobs.group <- span=""> 500
 
# Simulate our data
grp1 <- span=""> data.frame(values=rnorm(nobs.group,5,1), grp=1)
grp2 <- span=""> data.frame(values=rnorm(nobs.group,3,2), grp=2)
grp3 <- span=""> data.frame(values=rnorm(nobs.group,6,3), grp=3)
grp4 <- span=""> data.frame(values=rnorm(nobs.group,5,4), grp=4)
grp5 <- span=""> data.frame(values=rnorm(nobs.group,1,5), grp=5)
 
# Group our data into a single object
mydata <- span=""> rbind(grp1,grp2,grp3,grp4,grp5)
 
# Speficy the function to maximize (minimize)
lnp <- span=""> function(gamma, x, rank)
# I include a negative here because the default option with optim is minimize
-sum(log(dnorm(x,gamma[1]*(rank==1)+
gamma[2]*(rank==2)+
gamma[3]*(rank==3)+
gamma[4]*(rank==4)+
gamma[5]*(rank==5),
gamma[6]+gamma[7]*rank)))
 
ans <- span=""> optim(c(
# Specify initial values for parameters to be estimated
beta1=1,beta2=1,beta3=1,beta4=1, beta5=1,
gamma1=1,gamma2=1),
# Specify the function to minimize (maximize)
lnp,
# Input dependent variable as x and the explanatory variable as rank
x=mydata$values, rank=mydata$grp,
# Be sure to inlcude the hessian in the return for
# calculating standard errors
hessian=T)
 
# The standard erros can be estimated using the hessian
stand.error <- span=""> sqrt(diag(solve(ans$hessian)))
 
# This will create a nice table of results
cbind(par.est=ans$par,
stand.error,
tstat=ans$par/stand.error,
pvalue=1-pt(ans$par/stand.error, nrow(mydata)-length(ans$par)),
CILower=ans$par+stand.error*qt(.05,nrow(mydata)-length(ans$par)),
CIUpper=ans$par+stand.error*qt(.95,nrow(mydata)-length(ans$par))
) # Formatted by Pretty R at inside-R.org  par.est stand.error tstat pvalue CILower CIUpper
beta1 4.9894067 0.04038367 123.550112 0.0000000 4.9229567 5.05585658
beta2 2.0009055 0.10955198 18.264440 0.0000000 1.8206415 2.18116942
beta3 3.7640531 0.19407912 19.394427 0.0000000 3.4447027 4.08340355
beta4 6.4818562 0.23879420 27.144111 0.0000000 6.0889287 6.87478375
beta5 -0.5547626 0.29730735 -1.865957 0.9689176 -1.0439715 -0.06555377
gamma1 -0.4849449 0.05684407 -8.531142 1.0000000 -0.5784798 -0.39140993
gamma2 1.3867000 0.04520519 30.675682 0.0000000 1.3123164 1.46108352

To leave a comment for the author, please follow the link and comment on his blog: Econometrics by Simulation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Reminder: Abstract submission for the 2014 ‘R in Insurance’ conference will close this Friday

Mon, 2014-03-24 02:49

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

Don't forget, this is the final week you can submit an abstract for the second R in Insurance conference.
For more details see http://www.rininsurance.com and perhaps for inspiration review last year's programme.

To leave a comment for the author, please follow the link and comment on his blog: mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

MCMC on zero measure sets

Sun, 2014-03-23 19:14

(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)

Simulating a bivariate normal under the constraint (or conditional to the fact) that x²-y²=1 (a non-linear zero measure curve in the 2-dimensional Euclidean space) is not that easy: if running a random walk along that curve (by running a random walk on y and deducing x as x²=y²+1 and accepting with a Metropolis-Hastings ratio based on the bivariate normal density), the outcome differs from the target predicted by a change of variable and the proper derivation of the conditional. The above graph resulting from the R code below illustrates the discrepancy!

targ=function(y){ exp(-y^2)/(1.52*sqrt(1+y^2))} T=10^5 Eps=3 ys=xs=rep(runif(1),T) xs[1]=sqrt(1+ys[1]^2) for (t in 2:T){ propy=runif(1,-Eps,Eps)+ys[t-1] propx=sqrt(1+propy^2) ace=(runif(1)<(dnorm(propy)*dnorm(propx))/ (dnorm(ys[t-1])*dnorm(xs[t-1]))) if (ace){ ys[t]=propy;xs[t]=propx }else{ ys[t]=ys[t-1];xs[t]=xs[t-1]}}

If instead we add the proper Jacobian as in

ace=(runif(1)<(dnorm(propy)*dnorm(propx)/propx)/ (dnorm(ys[t-1])*dnorm(xs[t-1])/xs[t-1]))

the fit is there. My open question is how to make this derivation generic, i.e. without requiring the (dreaded) computation of the (dreadful) Jacobian.


Filed under: R, Statistics Tagged: conditional density, Hastings-Metropolis sampler, Jacobian, MCMC, measure theory, measure zero set, projected measure, random walk

To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Warning: Clusters May Appear More Separated in Textbooks than in Practice

Sun, 2014-03-23 18:07

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

Clustering is the search for discontinuity achieved by sorting all the similar entities into the same piles and thus maximizing the separation between different piles. The latent class assumption makes the process explicit. What is the source of variation among the objects? An unseen categorical variable is responsible. Heterogeneity arises because entities come in different types. We seem to prefer mutually exclusive types (either A or B), but will settle for probabilities of cluster membership when forced by the data (a little bit A but more B-like). Actually, we are more likely to acknowledge that our clusters overlap early on and then forget because it is so easy to see type as the root cause of all variation.

I am asking the reader to recognize that statistical analysis and its interpretation extend over time. If there is variability in our data, a cluster analysis will yield partitions. Given a partitioning, a data analyst will magnify those differences by focusing on contrastive comparisons and assigning evocative names. Once we have names, especially if those names have high imagery, can we be blamed for the reification of minor distinctions? How can one resist segments from Nielsen PRIZM with names like "Shotguns and Pickups" and "Upper Crust"? Yet, are "Big City Blues" and "Low-Rise Living" really separate clusters or simply variations on a common set of dwelling constraints?

Taking our lessons seriously, we expect to see the well-separated clusters displayed in textbooks and papers. However, our expectations may be better formed than our clusters. We find heterogeneity, but those differences are not clumping or distinct concentrations. Our data clouds can be parceled into regions, although those parcels run into one another and are not separated by gaps. So we name the regions and pretend that we have assigned names to types or kinds of different entities with properties that control behavior over space and time. That is, we have constructed an ontology specifying categories to which we have given real explanatory powers.

Consider the following scatterplot from the introductory vignette in the R package mclust. You can find all the R code needed to produce these figures at the end of this post.

This is the Old Faithful geyser data from the "datasets" R package showing the waiting time in minutes between successive eruptions on the y-axis and the duration of the eruptions along the x-axis. It is worth your time to get familiar with Old Faithful because it is one of those datasets that gets analyzed over and over again using many different programs. There seems to be two concentrations of points: shorter eruption that occurs more quickly and longer eruptions that have a longer waiting period. If we told the Mclust function from the mclust package that the scatterplot contains observations from G=2 groups, the function would produce a classification plot that looked something like this: 
The red and the blue with their respective ellipses are the two normal densities that are getting mixed. It is such a straightforward example of finite mixture or latent class models (as these models are also called by analysts in other fields of research). If we discovered that there were two pools feeding the geyser, we could write a compelling narrative tying all the data together.
The mclust vignette or manual is comprehensive but not overly difficult. If you prefer a lecture, there is no better introduction to finite mixture than the MathematicalMonk YouTube video. The key to understanding finite mixture models is recognizing that the underlying latent variable responsible for the observed data is categorical, a latent class which we do not observe, but which explains the location and shape of the data points. Do you have a cold or the flu? Without medical tests, all we can observe are your symptoms. If we filled a room with coughing, sneezing, achy and feverish people, we would find a mixture of cold and flu with differing numbers of each type.
This appears straightforward, but for the general case, how does decide how many categories, in what proportions, and with what means and covariance matrices? That is, those two ellipses in the above figure are drawn using a vector with means for the x- and y-axes plus a 2x2 covariance matrix. The means move the ellipse over the space, and the covariance matrix changes the shape and orientation of the ellipse. A good summary of the possible forms is given in Table 1 of the mlcust vignette. 
Unfortunately, the mathematics of the EM algorithm used to solve this problem gets complicated quickly. Fortunately, Chris Bishop provides an intuitive introduction in a 2004 video lecture. Starting at 44:14 of Part 4, you will find a step-by-step description of how the EM algorithm works with the Old Faithful data. Moreover, in Chapter 9 of his book, Bishop cleverly compares the workings of the EM and k-means algorithms, leaving us with a better understanding of both techniques.
If only my data showed such clear discontinuity and could be tied to such a convincing narrative.
Product Categories Are Structured around the Good, the Better, and the Best
Almost every product category offers a range of alternatives that can be described as good, better, and best. Such offerings reflect the trade-offs that customers are willing to make between the quality of the products they want and the amount that they are ready to spend. High-end customers demand the best and will pay for it. On the other end, one finds customers with fewer needs and smaller budgets accepting less and paying less. Clearly, we have heterogeneity, but are those differences continuous or discrete? Can we tell by looking at the data?
Unlike the Old Faithful data with well-separated grouping of data points, product quality-price trade-offs look more like the following plot of 300 consumers indicating how much they are willing to spend and what they expect to get for their money (i.e., product quality is a composite index combining desired features and services).
There is a strong positive relationship between demand for quality and willingness to pay, so a product manager might well decide that there was opportunity for at least a high-end and a low-end option. However, there is no natural breaks in the scatterplot. Thus, if this data cloud is a mixture of distinct distributions, then these distributions must be overlapping. 
Another example might help. As shown by John Cook, the distribution of heights among adults is a mixture two overlapping normal distribution, one for men and another for women. Yet, as you can observe from Cook's plots, the mixture of men's and women's height does not appear bimodal because the separation between the two distributions is not large enough. If you follow the links in Cook's post, eventually you will find the paper "Is Human Height Bimodal?", which clearly demonstrates that many mixtures of distributions appear to be homogeneous. We simply cannot tell that they are mixture by looking just at the shape of the distribution for the combined data. The Old Faithful data with its well-separated bimodal curve provides a nice contrast, especially when we focus only on waiting time as a single dimension (Fitting Mixture Models with the R Package mixtools).
Perhaps then, segmentation does not require gaps in the data cloud. As Wedel and Kamakura note, "Segments are not homogeneous groupings of customers naturally occurring in the marketplace, but are determined by the marketing manager's strategic view of the market." That is, one could look at the above scatterplot, see the presence of a strong first principal component from the closest of the data points to the principal axis of the ellipse and argue that customer heterogeneity is a single continuous dimension running from the low- to the high-end of the product spectrum. Or, one could look at the same scatterplot and see three overlapping segments seeking basic, value and premium products (good, better, best). Let's run mclust and learn if we can find our three segments.

When instructed to look for three clusters, the Mclust function returned the above result. The 300 observed points are represented as a mixture of three distributions falling along the principal axis of the larger ellipse formed by all the observations. However, if I had not specified three clusters and asked Mclust to use its default BIC criterion to select the number of segments, I would have been told that there was no compelling evidence for more than one homogeneous group. Without any prior specification Mclust would have returned a single homogeneous distribution, although as you can see from the R code below, my 300 observations were a mixture of three equal size distributions falling along the principal axis and separated by one standard deviation.
Number of Segments = Number Yielding Value to Product Management
Market segmentation lies somewhere between mass marketing and individual customization. When mass marketing fails because customers have different preferences or needs and customization is too costly or difficult, the compromise is segmentation. We do not need "natural" grouping, but just enough coherence for customers to be satisfied by the same offering. Feet come in many shapes and sizes. The shoe manufacturer can get along with three sizes of sandals but not three sizes of dress shoes. It is not the foot that is changing, but the demands of the customer. Thus, even if segments are no more than convenient fictions, they can be useful from the manager's perspective.
My warning still holds. Reification can be dangerous. These segments are meaningful only within the context of the marketing problem created by trying to satisfy everyone with products and services that yield maximum profit. Some segmentations may return clusters that are well-separated and represent groups with qualitatively different needs and purchase processes. Many of these are obvious and define different markets. If you don't have a dog, you don't buy dog food. However, when segmentation seeks to identify those who feel that their dog is a member of the family, we will find overlapping clusters that we treat differently not because we have revealed the true underlying typology, but because it is in our interest. Don't be fooled into believing that our creative segment names reveal the true workings of the consumer mind.
Finally, what is true for marketing segmentation is true for all of cluster analysis. "Clustering: Science or Art?" (a 2009 NIPS workshop) raises many of these same issues for cluster analysis in general. Videos of this workshop are available at Videolectures. Unlike supervised learning with its clear criterion for success and failure, clustering depends on users of the findings to tell us if the solution is good or bad, helpful or not. On the one hand, this seems to make everything more difficult. On the other hand, it frees us to be more open to alternative methods for describing heterogeneity as it is now and how it evolves over time.  
We seek to understand the dynamic structure of diversity, which only sometimes takes the form of cohesive clusters separated by gaps. Other times, a model with only continuous latent variables seems to be the best choice (e.g., brand perceptions). And, not unexpectedly, there are situations where heterogeneity cannot be explained without both categorical and continuous latent variables (e.g., two or more segments seeking alternative benefit profiles with varying intensities).
Yet, even these three combinations cannot adequately account for all the forms of diversity we find in consumer data.  Innovation might generate a structure appearing more like long arrays or streams of points seemingly pulled toward the periphery by an archetypal ideal or aspirational goal. And if the coordinate space of k-means and mixture models becomes too limiting, we can replace it with pairwise dissimilarity and graphical clustering techniques, such as affinity propagation or spectral clustering. Nor should we be wedded to the stability of our segment solution when those segments were created by dynamic forces that continue to act and alter its structure. Our models ought to be as diverse as the objects we are studying.
R code for all figures and analysis
#attach faithful data set
data(faithful)
plot(faithful, pch="+")
 
#run mclust on faithful data
require(mclust)
faithfulMclust<-Mclust(faithful, G=2)
summary(faithfulMclust, parameters=TRUE)
plot(faithfulMclust)
 
#create 3 segment data set
require(MASS)
sigma <- matrix(c(1.0,.6,.6,1.0),2,2)
mean1<-c(-1,-1)
mean2<-c(0,0)
mean3<-c(1,1)
set.seed(3202014)
mydata1<-mvrnorm(n=100, mean1, sigma)
mydata2<-mvrnorm(n=100, mean2, sigma)
mydata3<-mvrnorm(n=100, mean3, sigma)
mydata<-rbind(mydata1,mydata2,mydata3)
colnames(mydata)<-c("Desired Level of Quality",
"Willingness to Pay")
plot(mydata, pch="+")
 
#run Mclust with 3 segments
mydataClust<-Mclust(mydata, G=3)
summary(mydataClust, parameters=TRUE)
plot(mydataClust)
 
#let Mclust decide on number of segments
mydataClust<-Mclust(mydata)
summary(mydataClust, parameters=TRUE)
Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Engaging Market Research. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Random Love

Sun, 2014-03-23 17:53

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin (John von Newman)

Ms. Positive and Mr. Negative live in a one-dimensional world and are falling in love. But beginnings are not always easy. They have a big problem: none of them like the other’s neighborhood. Ms. Positive only wants to walk around Positive Integer Numbers Neighborhood and Mr. Negative around Negative Integers Numbers one. This is a prickly problem they need to deal with as soon as possible. But they have a good idea. They will start their walks from Zero, an impartial place between both neighborhoods and will let fate to guide their feet. They will toss a coin to decide every step: if result is head, they will advance 1 step toward positive numbers neighborhood; if tail, they will advance 1 step toward negative numbers one. For example, if the first 5 tosses are face, face, tail, tail and tail, the their first 5 steps will be +1, +2, +1, 0 and -1. It seems to be a fair agreement for both. Maybe is not the most pleasant way to take a walk but It is well known that lovers use to do silly things constantly, especially at the beginnings. They always walk for two hours, so they toss the coin 7.200 times every walk (these lovers are absolutely crazy as you can see). This was their first walk:

After this first walk, Mr Negative was really upset. Ms. Positive, watching his face fell, ask him: What’s the matter, honey? and Mr. Negative replied: What’s the matter? What’s the matter? The matter is that we spent almost all the time walking around your horrible neighborhood! What comes next is too hard to be reproduced here. Anyway, they agreed to give a chance to the method they designed. How can one imagine that a coin can produce such a strange walk! There must be an error! After 90 walks, the situation of our lovers was extremely delicate. A 57% of the walks were absolutely awful for one of them since more than 80% of the steps were around the same neighborhood. Another 32% were a bit uncomfortable for one of them since between 60% and 80% of the steps were around the same neighborhood. Only 11% of the walks were gratifying. How is it possible?, said Mr. Negative. How is it possible?, said Ms. Positive.

But here comes Ms. Positive, who always looks on the brigth side of life: Don’t worry, darling. In fact, we don’t have to be sad. We get angry the same amount of times! For me is enough. What about you?, said her. For me is perfect as well!, said Mr. Negative. In that moment, they realise they were made for each other and started another random walk with a big smile on their faces.

This is the code:

library(ggplot2) steps <- 2*60*60 #Number of steps results <- data.frame() walks<-90 #Number of walks for (i in 1:walks) { state <- cumsum(sample(c(-1,1), steps, replace = TRUE)) results <- rbind(results, c(sum(state<0), sum(state>0), sum(state==0), if (sum(state<0) >= sum(state>0)) 1 else 0)) } colnames(results) <- c("neg.steps", "pos.steps", "zero.steps", "ind.neg") results$max.steps <- apply(results, 1, max)/apply(results, 1, sum) #Plot of one of these walks mfar=max(abs(max(state)),abs(min(state))) plot1 <- qplot(seq_along(state), state, geom="path")+ xlab("Step") + ylab("Location") + labs(title = "The First Walk Of Ms. Positive And Mr. Negative")+ theme(plot.title = element_text(size = 35))+ theme(axis.title.y = element_text(size = 20))+ theme(axis.title.x = element_text(size = 20))+ scale_x_continuous(limits=c(0, length(state)),breaks=c(1,steps/4,steps/2,3*steps/4,steps))+ scale_y_continuous(limits=c(-mfar, mfar), breaks=c(-mfar,-mfar/2, 0, mfar/2,mfar))+ geom_hline(yintercept=0) ggsave(plot1, file="plot1.png", width = 12, height = 10) #Summary of all walks hist1 <- ggplot(results, aes(x = max.steps))+ geom_histogram(colour = "white",breaks=seq(.4,1,by=.2),fill=c("blue", "orange", "red"))+ theme_bw()+ labs(title = paste("What Happened After ", toString(walks), " Walks?",sep = ""))+ scale_y_continuous(breaks=seq(0,(nrow(results[results$max.steps>.8,])+10),by=10))+ theme(plot.title = element_text(size = 40))+ xlab("Maximum Steps In The Same Location (%)") + ylab("Number of Walks") ggsave(hist1, file="hist1.png", width = 10, height = 8) #Data for waterfall chart waterfall <- as.data.frame(cbind( c("Total Walks", "Satisfactory Walks", "Uncomfortable Walks", "Awful Walks for Mr. +", "Awful Walks for Ms. -"), c("a", "b", "c", "d", "d"), c(0, nrow(results), nrow(results)-nrow(results[results$max.steps<.6,]), nrow(results)-nrow(results[results$max.steps<.6,])-nrow(results[results$max.steps>=.6 & results$max.steps<.8,]), nrow(results)-nrow(results[results$max.steps<.6,])-nrow(results[results$max.steps>=.6 & results$max.steps<.8,])-nrow(results[results$max.steps>=.8 & results$ind.neg==1,]) ), c(nrow(results), nrow(results)-nrow(results[results$max.steps<.6,]), nrow(results)-nrow(results[results$max.steps<.6,])-nrow(results[results$max.steps>=.6 & results$max.steps<.8,]), nrow(results)-nrow(results[results$max.steps<.6,])-nrow(results[results$max.steps>=.6 & results$max.steps<.8,])-nrow(results[results$max.steps>=.8 & results$ind.neg==1,]), 0 ), c(nrow(results), nrow(results[results$max.steps<.6,]), nrow(results[results$max.steps>=.6 & results$max.steps<.8,]), nrow(results[results$max.steps>=.8 & results$ind.neg==1,]), nrow(results[results$max.steps>=.8 & results$ind.neg==0,])) )) colnames(waterfall) <-c("desc", "type", "start", "end", "amount") waterfall$id <- seq_along(waterfall$amount) waterfall$desc <- factor(waterfall$desc, levels = waterfall$desc) #Waterfall chart water1 <- ggplot(waterfall, aes(desc, fill = type)) + geom_rect(aes(x = desc, xmin = id-0.45, xmax = id+0.45, ymin = end, ymax = start))+ xlab("Kind of Walk") + ylab("Number of Walks") + labs(title = "The Ultimate Proof (After 90 Walks)")+ theme(plot.title = element_text(size = 35))+ theme(axis.title.y = element_text(size = 20))+ theme(axis.title.x = element_text(size = 20))+ theme(legend.position = "none")

To leave a comment for the author, please follow the link and comment on his blog: Ripples. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs