R bloggers

Syndicate content
R news and tutorials contributed by (563) R bloggers
Updated: 6 hours 5 min ago

Multiple Tests, an Introduction

Wed, 2014-09-24 16:12

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

Last week, a student asked me about multiple tests. More precisely, she ran an experience over – say – 20 weeks, with the same cohort of – say – 100 patients. An we observe some

size=100 nb=20 set.seed(1) X=matrix(rnorm(size*nb),size,nb)

(here, I just generate some fake data). I can visualize some trajectories, over the 20 weeks,

library(RColorBrewer) cl1=brewer.pal(12,"Set3")[12] cl2=brewer.pal(8,"Set2")[2:7] cl=c(cl1,cl2) boxplot(X) for(i in 1:length(cl)){ lines(1:20,X[i,],type="b",col=cl[i]) }

 

She wanted to compare the averages, over the experiments.

boxplot(X) for(i in 1:length(cl)){ lines(1:20,X[i,],type="b",col=cl[i]) }

The question is what do you want to test ? Is it

versus

Because that one might be not very informative (if we reject , we don’t really know why). Here, I did consider completely independent observations, over individuals, and over time. Nevertheless, if we perform 190 tests on pairs, we should not be suprised – even if is valid – that we reject it many times !

> P=NULL > for(i in 1:(nb-1)){ for(j in (i+1):nb){ + P=c(P,t.test(X[,i],X[,j])$p.value) }} > mean(P>.05) [1] 0.9368421

i.e. for 12 pairs, we reject the assumption of identical mean. Even if that assumption is true, here. That’s the problem with multiple tests.

Actually, observe that over those 190 tests, the -values are uniformly distributed over the unit interval. Thus, for (almost) 95% of the tests, the -value exceeds 5%.

Now, what if we relax the assumption of independence ? Here, we need to be more specific. A first idea might be to assume that

for some trend . We can actually assume that the trend depends on the individuals, and write, for instance,

Standard regression techniques can be considered, under those models. We’ll probably discuss that point before the end of this year.

My point was the following : if the assume that

consist of independent observations, where all ‘s are assumed to be independent, if we use pairwise comparison tests for the means, in 95% of the tests, the -value exceeds 5%, and in 90% of the tests, the -value exceeds 10%. That the interpreation of the property that -values are uniformly distributed on the unit interval. Now, what if we relax the independence assumption. What if obsevations were – somehow – correlated.

A first idea could be to consider a linear trend,

X=matrix(rnorm(size*nb),size,nb) T=t(matrix((1:nb)/nb*r,nb,size)) X=X+T*a

Here, we still compute pairwise comparison tests

> P=NULL > for(i in 1:(nb-1)){ for(j in (i+1):nb){ + P=c(P,t.test(X[,i],X[,j])$p.value) }}

and we counts the number of pairs where the -values exceeds some given values

PV[1]=mean(P>.025) PV[2]=mean(P>.05) PV[3]=mean(P>.075) PV[4]=mean(P>.1) PV[5]=mean(P>.125) PV[6]=mean(P>.15)

If we look at the proportions, as a function of the slope, we have the following graph

In the middle, we have the independent case, e.g. 97.5% of the the -values exceeds 2.5% – the upper curve. But, as the slope increase (or decrease), the proportion decreases. With a 0.2 slope, 90% of the the -values exceeds 2.5%.

One can also assume that

is some time series. An autoregressive one, for instance,

Here we use to following code to generate that vector

autoreg=function(n,r){ M=matrix(1,n,n) for(i in 1:(n-2)){ diag(M[(i+1):n,1:(n-i)])=r^i diag(M[1:(n-i),(i+1):n])=r^i } M[1,n]=M[n,1]=r^(i+1) return(M) } S=autoreg(nb,r) library(mnormt) X=rmnorm(size,varcov=S)

The graph below is the evolution of the proportion of -values exceeding some threshold, as a function of the autoregressive coefficient.

The value on the left is the independent case (and -values are uniformly distributed). The stronger the autocorrelation, the larger the proportion of -values exceeding any value.

Finally, consider the case where components of vector

are exchangeable.

S=matrix(r,nb,nb) diag(S)=1 library(mnormt) X=rmnorm(size,M,varcov=S)

In that case, we have the following graph, with the proportion of -values exceeding some value, as the function of the correlation of (any) pair .

This is the idea that is used in any correction-type method, with multiple tests, for instance Bonferroni correction. When a run tests, instead of comparing the -value with , we use a correction on , . But as we can see here, it will depend on the kind of dependence we assume there might be.

Next time, I will discuss my student’s data, see what could be said.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Adding Google Drive Times and Distance Coefficients to Regression Models with ggmap and sp

Wed, 2014-09-24 14:34

(This article was first published on SpatioAnalytics » r, and kindly contributed to R-bloggers)

Space, a wise man once said, is the final frontier.

Not the Buzz Alrdin/Light Year, Neil deGrasse Tyson kind (but seriously, have you seen Cosmos?). Geographic space. Distances have been finding their way into metrics since the cavemen (probably). GIS seem to make nearly every science way more fun…and accurate!

Most of my research deals with spatial elements of real estate modeling. Unfortunately, “location, location, location” has become a clichéd way to begin any paper or presentation pertaining to spatial real estate methods. For you geographers, it’s like setting the table with Tobler’s first law of geography: a quick fix (I’m not above that), but you’ll get some eye-rolls. But location is important!

One common method of taking location and space into account in real estate valuation models is by including distance coefficients (e.g. distance to downtown, distance to center of city). The geographers have this straight-line calculation of distance covered,  and R can spit out distances between points in a host of measurement systems (euclidean, great circle, etc.). This straight-line distance coefficient is a helpful tool when you want to help reduce some spatial autocorrelation in a model, but it doesn’t always tell the whole story by itself (please note: the purpose of this post is to focus on the tools of R and introduce elements of spatial consideration into modeling. I’m purposefully avoiding any lengthy discussions on spatial econometrics or other spatial modeling techniques, but if you would like to learn more about the sheer awesomeness that is spatial modeling, as well as the pit-falls/pros and cons of each, check out Luc Anselin and Stewart Fotheringham for starters. I also have papers being publishing this fall and would be more than happy to forward you a copy if you email me. They are:

Bidanset, P. & Lombard, J. (2014). The effect of kernel and bandwidth specification in geographically weighted regression models on the accuracy and uniformity of mass real estate appraisal. Journal of Property Tax Assessment & Administration. 11(3). (copy on file with editor).

and

Bidanset, P. & Lombard, J. (2014). Evaluating spatial model accuracy in mass real estate appraisal: A comparison of geographically weighted regression (GWR) and the spatial lag model (SLM). Cityscape: A Journal of Policy Development and Research. 16(3). (copy on file with editor).).

Straight-line distance coefficients certainly can help account for location, as well as certain distance-based effects on price. Say you are trying to model negative externalities of a landfill in August, assuming wind is either random or non-existent, straight-line distance from the landfill to house sales could help capture the cost of said stank. Likewise with capturing potential spill-over effects of an airport  – the sound of jets will diminish as space increases, and the path of sound will be more or less a straight line.

But  again, certain distance-based elements cannot be accurately represented with this method. You may expect ‘distance to downtown’ to have an inverse relationship with price: the further you out you go, more of a cost is incurred (in time, gas, and overall inconvenience) getting to work and social activities, so demand for these further out homes decreases, resulting in cheaper priced homes (pardon the hasty economics). Using straight-line distances to account commute in a model, presents some problems (aside: There is nary a form of visualization capable of presenting one’s point more professionally than Paint, and as anyone who has ever had the misfortune of being included in a group email chain with me knows, I am a bit of a Paint artist.). If a trip between a person’s work and a person’s home followed a straight line, this would be less of a problem (artwork below).

But we all know commuting is more complicated than this. There could be a host of things between you and your place of employment that would make a straight-line distance coefficient an inept method of quantifying this effect on home values … such as a lake:

… or a Sarlacc pit monster:

Some cutting edge real estate valuation modelers are now including a ‘drive time’ variable. DRIVE TIME! How novel is that? This presents a much more accurate way to account for a home’s distance – as a purchaser would see it – from work, shopping, mini-golf, etc. Sure it’s been available in (expensive) ESRI packages for some time, but where is the soul in that? The altruistic R community has yet again risen to the task.

To put some real-life spin on the example above, let’s run through a very basic regression model for modeling house prices.

sample = read.csv("C:/houses.csv", header=TRUE) model1 <- lm(ln.ImpSalePrice. ~ TLA + TLA.2 + Age + Age.2 + quality + condition, data = sample)

We read in a csv file “houses” that is stored on the C:/ drive and name it “sample”. You can name it anything, even willywonkaschocolatefactory. We’ll name the first model “model1″. The dependent variable, ln.ImpSalePrice.,  is a log form of the sale price. TLA is ‘total living area’ in square feet. Age is, well, age of the house, and quality and condition are dummy variables. The squared variables of TLA and Age are to capture any diminishing marginal returns.

AIC stands for ‘Akaike information criterion’. Some guy from Japan coined it in the 70’s and it’s a goodness-of-fit measurement to compare models used on the same sample (the lower the AIC, the better).

> AIC(model1) [1] 36.35485

The AIC of model1 is 36.35.

Now we are going to create some distance variables to add to the model. First we’ll do the straight-line distances. We make a matrix  called “origin” consisting of  start-points, which in this case is the lat/long of each house in our dataset.

origin <- matrix(c(sample$lat, sample$lon), ncol=2)

We next create a destination – to where we will be measuring the distance. For this example, I decided to measure the distance to a popular shopping mall downtown (why not?). I obtained the lat/long coordinates for the mall by right clicking on it in Google Maps and clicking “whats here?” (also could’ve geocoded in R).

destination <- c(36.84895, -76.288018)

Now we use the  spDistsN1 function to calculate the distance. We denote longlat=TRUE so we can get the value from origin to destination in kilometers. The second line just adds this newly created column of distances to our dataset and names it dist.

km <- spDistsN1(origin, destination, longlat=TRUE) sample$dist <- km

This command I learned from a script on Github – initially committed by Peter Schmiedeskamp – which alerted me to the fact that R was capable of grabbing drive-times from the Google Maps API.  You can learn a great deal from his/their work so give ‘em a follow!

library(ggmap) library(plyr) google_results <- rbind.fill(apply(subset(sample, select=c("location", "locMall")), 1, function(x) mapdist(x[1], x[2], mode="driving")))

location is the column containing each house’s lat/long coordinates, in the following format (36.841287,-76.218922). locMall is a column in my data set with the lat/long coords of the mall in each row. Just to clarify: each cell in this column had the exact same value, while each cell of “location” was different.  Also something amazing: mode can either be “driving,” “walking,” or “bicycling”!

Now let’s look at the results:

> head(google_results,4) from to m km miles seconds minutes 1 (36.901373,-76.219024) (36.848950, -76.288018) 10954 10.954 6.806816 986 16.433333 2 (36.868871,-76.243859) (36.848950, -76.288018) 7279 7.279 4.523171 662 11.033333 3 (36.859805,-76.296122) (36.848950, -76.288018) 2101 2.101 1.305561 301 5.016667 4 (36.938692,-76.264474) (36.848950, -76.288018) 12844 12.844 7.981262 934 15.566667 hours 1 0.27388889 2 0.18388889 3 0.08361111 4 0.25944444

Amazing, right? And we can add this to our sample and rename it “newsample”:

newsample <- c(sample, google_results)

Now let’s add these variables to the model and see what happens.

model2 <- lm(ln.ImpSalePrice. ~ TLA + TLA.2 + Age + Age.2 + quality+ condition + dist,data = newsample) > AIC(model2) [1] 36.44782

Gah, well, no significant change. Hmm…let’s try the drive-time variable…

model3 <- lm(ln.ImpSalePrice. ~ TLA + TLA.2 + Age + Age.2 + quality + condition + minutes,data = newsample) > AIC(model3) [1] 36.10303

Hmm…still no dice. Let’s try them together.

> AIC(model3) model4 <- lm(ln.ImpSalePrice. ~ TLA + TLA.2 + Age + Age.2 + quality + condition + minutes + dist,data = newsample) > AIC(model4) [1] 32.97605

Alright! AIC has been reduced by more than 2 so they together have a statistically significant effect on the model.

Of course this is a grossly reduced model, and would never be used for actual valuation/appraisal purposes, but it does lay elementary ground work for creating distance-based variables, integrating them, and demonstrating their ability to marginally improve models.

Thanks for reading. So to bring back Cap’n Kirk, I think a frontier more ultimate than space, in the modeling sense,  is space-time – not Einstein’s, rather ‘spatiotemporal’.  That will will be for another post!

Toodles,

Paul


To leave a comment for the author, please follow the link and comment on his blog: SpatioAnalytics » r. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Data Science Toolbox Survey Results… Surprise! R and Python win

Wed, 2014-09-24 13:25

(This article was first published on Data Science Los Angeles » R, and kindly contributed to R-bloggers)

This is a re-publication of a blog post from a blog I created not long before we got the idea to start DataScience.LA, and in many aspects it is the genesis of DSLA (the blog’s name, the people getting together – thanks Eduardo, Amelia and Leigh). It was also the spark that lead to the creation of the Python Data Science LA meetup.

At a previous LA Data Science/Machine Learning meetup we did an informal survey (using a quick show of hands and a very approximate count) of the software tools used for data analysis by those in the audience. With about 200 people in attendance and about 60% saying they are data scientists or perform a similar job, I believe the results below are more representative and less biased than our previous attempts via meetup polls (where relatively few responded) or even more formal surveys others have attempted (e.g. my friends at KDnuggets or Rexer Analytics). I know, there is a likely bias in our results as well, towards open source tools, maybe our meetups are just too cool for all those SAS users out there (I kid, I kid.)

Considering the various parts of the process of analyzing data,

we surveyed the audience for tools used in:

  1. data munging (“explore”, “clean” and “transform” above) – both exploratory data analysis (EDA) and operational ETL,
  2. visualization – both exploratory and presentational,
  3. machine learning/modeling.

We started with tools that I (Szilard) thought must be the most popular, but we also asked what other tools people are using, so we didn’t miss any hidden gems.

When we asked about data munging we found that about 60% are using R in some part of their data science process, 50% were using Python, 40% SQL, 30% Hadoop (mostly Hive), 20% Unix shell. Only 10% acknowledged using Excel. Other tools used by some of our attending data scientists included Perl, Matlab, SAS – Pig, Impala, Shark within Hadoop. Clojure, Scalding, Elasticsearch made an appearance with just one user each.

For visualization, about 40% were using R, 30% using Python, 10% Tableau, 10% building custom in Javascript (although we admit we did not ask which libraries). Just a few people are using Excel or Matlab, and there was a Splunk user out there too.

When it came to machine learning, only 30% of the audience raised their hands saying they used R, with about 30% using Python. There were quite a few tools mentioned with 2-3 hands apiece, such as Vowpal Wabbit, SAS, SPSS, Matlab, Mahout. In the minority were users of Spark MLlib, Graphlab, Shogun, and Weka.

I feel that this quick and dirty poll gives us a starting picture of a typical data scientist’s toolbox across a few of our major tasks. With a more formal survey we can get into further detail within our rough categories, such as breaking down by usage for EDA vs ETL or separating into groups those who use various R/Python packages. As you could imagine, we’re thinking of doing this in the near future – stay tuned.

I’d like to thank Eduardo Arino de la Rubia for helping with the survey. Feel free to post your personal tools of choice in the comments below (especially if it’s not mentioned above)!

 

To leave a comment for the author, please follow the link and comment on his blog: Data Science Los Angeles » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs