R bloggers

Syndicate content
R news and tutorials contributed by (552) R bloggers
Updated: 10 hours 56 min ago

Introducing H2O Lagrange (2.6.0.11) to R

Mon, 2014-09-01 03:00

(This article was first published on 0xdata Blog, and kindly contributed to R-bloggers)

From my perspective the most important event that happened at useR! 2014 was that I got to meet the 0xdata team and now, long story short, here I am introducing the latest version of H2O, labeled Lagrange (2.6.0.11), to the R and greater data science communities. Before joining 0xdata, I was working at a competitor on a rival project and was repeatedly asked why my generalized linear model analytic didn’t run as fast as H2O’s GLM. The answer then as it is now is the same — because H2O has a cutting edge distributed in-memory parallel computing architecture — but I no longer receive an electric shock every time I say so.

For those hearing about H2O for the first time, it is an open-source distributed in-memory data analysis tool designed for extremely large data sets and the H2O Lagrange (2.6.0.11) release provides scalable solutions for the following analysis techniques:

In my first blog post at 0xdata, I wanted to keep it simple and make sure R users know how to get the h2o package, which is cross-referenced on the High-Performance and Parallel Computing and Machine and Statistical Learning CRAN Task Views, up and running on their computers. To so do, open an R console of your choice and type

# Download, install, and initialize the H2O package install.packages("h2o", repos = c("http://h2o-release.s3.amazonaws.com/h2o/rel-lagrange/11/R", getOption("repos"))) library(h2o) localH2O <- h2o.init() # List and run some demos to see H2O at work demo(package = "h2o") demo(h2o.glm) demo(h2o.deeplearning)

After you are done experimenting with the demos in R, you can open up a web browser to http://localhost:54321/ to give the H2O web interface a once over and then hop over to 0xdata’s YouTube channel for some in-depth talks.

Over the coming weeks we at 0xdata will continue to blog about how to use H2O through R and other interfaces. If there is a particular use case you would like to see addressed, join our h2ostream Google Groups conversation or e-mail us at support@0xdata.com. Until then, happy analyzing.

Related Blogs

R-bloggers

To leave a comment for the author, please follow the link and comment on his blog: 0xdata Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

A new candidate for worst figure

Mon, 2014-09-01 01:03

(This article was first published on Hyndsight » R, and kindly contributed to R-bloggers)

Today I read a paper that had been submitted to the IJF which included the following figure

along with several similar plots. (Click for a larger version.) I haven’t seen anything this bad for a long time. In fact, I think I would find it very difficult to reproduce using R, or even Excel (which is particularly adept at bad graphics).

A few years ago I produced “Twenty rules for good graphics”. I think I need to add a couple of additional rules:

  • Represent time changes using lines.
  • Never use fill patterns such as cross-hatching.

(My original rule #20 said Avoid pie charts.)

It would have been relatively simple to show these data as six lines on a plot of GDP against time. That would have made it obvious that the European GDP was shrinking, the GDP of Asia/Oceania was increasing, while other regions of the world were fairly stable. At least I think that is what is happening, but it is very hard to tell from such graphical obfuscation.

To leave a comment for the author, please follow the link and comment on his blog: Hyndsight » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

MongoDB – State of the R

Sun, 2014-08-31 11:27

(This article was first published on joy of data » R, and kindly contributed to R-bloggers)

Naturally there are two reasons for why you need to access MongoDB from R:

  1. MongoDB is already used for whatever reason and you want to analyze the data stored therein
  2. You decide you want store your data in MongoDB instead of using native R technology like data.table or data.frame

In-memory data storage like data.table is very fast especially for numerical data, provided the data actually fits into your RAM – but even then MongoDB comes along with a bag of goodies making it a tempting choice for a number of use cases:

  • Flexible schema-less data structures
  • spatial and textual indexing
  • spatial queries
  • persistence of data
  • easily accessible from other languages and systems

In case you would like to learn more about MongoDB then I have good news for you – MongoDB Inc. provides a number of very well made online courses catering to various languages. An overview you may find here.

rmongodb versus RMongo

The good news is – there are two packages available for making R talk to MongoDB.

For a larger project at work I decided to go with rmongodb because it does not require Java in contrast to RMongo and it seems to be more actively developed – not to mention that MongoDB Incorporation itself has a finger in the pie apparently. And I can say I did not regret that choice. It’s a great package – and along these lines a big thank you to Markus Schmidberger for investing his time in its development. Having said that – there are a few quirks and a rather uncomfortable not-yet resolved issue one is going to face for non-trivial queries. But before I get to those subjects let me first give you a quick introduction into its application.

Storing Data with rmongodb

I assume that a MongoDB daemon is running on your localhost. By the way installing and using MongoDB from R is pretty much effortless for Ubuntu as well as Windows. First we are going to establish a connection and check whether we were successful.

> library(rmongodb) > > m <- mongo.create() > ns <- "database.collection" > > mongo.is.connected(m) [1] TRUE

Now let’s insert an object which we define using JSON notation:

json <- '{"a":1, "b":2, "c": {"d":3, "e":4}}' bson <- mongo.bson.from.JSON(json) mongo.insert(m, ns, bson)

Maybe this intermediate step is a bit surprising – after all MongoDB stores JSONs!? Well, it doesn’t – it works with BSONs.

BSON [bee · sahn], short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. Like JSON, BSON sup­ports the em­bed­ding of doc­u­ments and ar­rays with­in oth­er doc­u­ments and ar­rays. BSON also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec. For ex­ample, BSON has a Date type and a BinData type. [bsonspec.org]

 

The data structure closest to a JSON in R is a list – so naturally we can use that too for specifying a document to be inserted:

list <- list(a=2, b=3, c=list(d=4, e=5)) bson <- mongo.bson.from.list(list) mongo.insert(m, ns, bson)

 Retrieving Data from MongoDB

Now let me show you how to retrieve those two documents and print them – for illustrative purposes I query for documents whose field “a” holds a value greater or equal to 1:

> json <- '{"a":{"$gte":1}}' > bson <- mongo.bson.from.JSON(json) > cursor <- mongo.find(m, ns, bson) > while(mongo.cursor.next(cursor)) { + value <- mongo.cursor.value(cursor) + list <- mongo.bson.to.list(value) + str(list) + } List of 4 $ _id:Class 'mongo.oid' atomic [1:1] 1 .. ..- attr(*, "mongo.oid")=<externalptr> $ a : num 1 $ b : num 2 $ c : Named num [1:2] 3 4 ..- attr(*, "names")= chr [1:2] "d" "e" List of 4 $ _id:Class 'mongo.oid' atomic [1:1] 0 .. ..- attr(*, "mongo.oid")=<externalptr> $ a : num 2 $ b : num 3 $ c : Named num [1:2] 4 5 ..- attr(*, "names")= chr [1:2] "d" "e"

Also the search query is only superficially a JSON and hence has to be casted to BSON before applying it. The result is a cursor which has to be iterated and leads to the resulting documents – as BSONs, of course. This little ritual with converting to and from BSON feels a bit clumsy at times but one gets used to it eventually and of course nothing keeps you from writing a more comfortable wrapper.

Implicit Coversion of Sub-Document to Vector

For the purpose of illustrating my point I added a document into the collection from MongoDB shell:

db.collection.insert({_id:1, a:{b:2, c:3, d:4, e:5}})

Now what you will see is that rmongodb implicitely casts the lowest sub-document as named R vector:

> json <- '{"_id":1}' > bson <- mongo.bson.from.JSON(json) > cursor <- mongo.find(m, ns, bson) > mongo.cursor.next(cursor) [1] TRUE > value <- mongo.cursor.value(cursor) > list <- mongo.bson.to.list(value) > print(list) $`_id` [1] 1 $a b c d e 2 3 4 5 > str(list) List of 2 $ _id: num 1 $ a : Named num [1:4] 2 3 4 5 ..- attr(*, "names")= chr [1:4] "b" "c" "d" "e" # it is not possible to access the sub-document as a list() > list$a$b Error in list$a$b : $ operator is invalid for atomic vectors > list$a["b"] b 2

This is something you have to keep in mind. Personally I find this to be a tad uncomfortable. As a list() would work, too, and would allow for a more homogenous processing. But that is just my two cents.

 Formulating Queries Using Arrays is Problematic

At the present it is not possible to easily formulate a BSON containing an array. This is primarily a problem when you want to query documents and that query expression needs an array. For example:

{"$or": [{"a":1}, {"a":3}]}

Let’s assume three documents in database.collection :

> db.collection.find() { "_id" : ObjectId("53fe48857f953e7c617eea04"), "a" : 1 } { "_id" : ObjectId("53fe48877f953e7c617eea05"), "a" : 2 } { "_id" : ObjectId("53fe488a7f953e7c617eea06"), "a" : 3 }

We would expect to receive the two obvious documents. But doing so in rmongodb currently yields an error:

> library(rmongodb) > M <- mongo.create("localhost") > mongo.is.connected(M) [1] TRUE > > qry1 <- list( + "a" = 1 + ) > > qry2 <- list( + "$or" = list( + list("a" = 1), + list("a" = 3) + ) + ) > > qry1 <- mongo.bson.from.list(qry1) > qry2 <- mongo.bson.from.list(qry2) > > mongo.count(M, "test.xxx", qry1) [1] 1 > mongo.count(M, "test.xxx", qry2) [1] -1 > mongo.get.last.err(M, "test") connectionId : 16 24 err : 2 $or needs an array code : 16 2 n : 16 0 ok : 1 1.000000

 Building Up the BSON Buffer from Scratch

Internally rmongodb constructs a BSON for a JSON by constructing an initial BSON buffer and then adding to it the elements of the JSON-document. For the suggested query this would be done as follows:

buf <- mongo.bson.buffer.create() # "$or":[ ... mongo.bson.buffer.start.array(buf, "$or") # dummy name "0" for object in array # "0": { ... mongo.bson.buffer.start.object(buf, "0") # "a":1 mongo.bson.buffer.append.int(buf, "a", 1) # ... } mongo.bson.buffer.finish.object(buf) mongo.bson.buffer.start.object(buf, "1") mongo.bson.buffer.append.int(buf, "a", 3) mongo.bson.buffer.finish.object(buf) # ...] mongo.bson.buffer.finish.object(buf) bson <- mongo.bson.from.buffer(buf)

Obviously this approach is going to get quite complicated and error-prone for even mildly complex queries/documents. But then again this method of building up a BSON is also quite generic and straightforward. So, to simplify this task I composed a little package that recursively traverses the representing list and invokes those functions appropriately.

rmongodbHelper at Your Service

The following code will turn the or-query from a JSON into a BSON:

# install rmongodbHelper package from GitHub # install.packages("devtools") library(devtools) devtools::install_github("joyofdata/rmongodbHelper") library(rmongodbHelper) json_qry <- '{ "$or": [ {"a":1}, {"a":3} ] }' bson <- rmongodbhelper::json_to_bson(json_qry) cur <- mongo.find(M, "dbx.collx", bson) while(mongo.cursor.next(cur)) { print(mongo.cursor.value(cur)) }

And its result:

_id : 7 53fa14315aed8483db4ae794 a : 16 1 _id : 7 53fa14315aed8483db4ae796 a : 16 3

Why Oh Why?

The reason why I took the time to craft a solution for this issue is two fold:

  1. I don’t program C, so I cannot contribute to rmongodb by fixing the issue myself
  2. The issue seems to be around already for almost a year

I really hope my solution will become superfluous very soon because R deserves an awesome MongoDB connector and rmongodb is pretty damn awesome already.

Some Details about rmongodbHelper

I answered a couple of questions on stackoverflow.com which will provide more code on how to use the package:

Keep few rules in mind in case you would like to use it:

  • It is version 0.0.0 – so there will be bugs – you are welcome to contribute or complain
  • If you would like to feed it a JSON then keep in mind that all keys have to be placed within double quotes
  • '"x":3'  will be casted as double
  • '"x":__int(3)'  will be casted as integer
  • Internally arrays and sub-documents are implemented as list()s. They are differentiated by non-existence of names() for arrays and presence of nams for sub-documents. An empty array has to contain one list()-element .ARR with arbitrary value.

To give an example for the last point:

L <- list( obj = list( obj = list(), array = list(.ARR=1) ), arr = list( list(a = 1, b = 1), list("c","d") ) ) bson <- rmongodbhelper::list_to_bson(L) mongo.insert(M, "database.collection", bson)

The resulting document will look pretty-printed on MongoDB shell as follows:

> use database > db.collection.find().pretty() { "_id" : ObjectId("540085c4d915dc2b16c1327a"), "obj" : { "obj" : { }, "array" : [ ] }, "arr" : [ { "a" : 1, "b" : 1 }, [ "c", "d" ] ] }

(original article published on www.joyofdata.de)

To leave a comment for the author, please follow the link and comment on his blog: joy of data » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Periodic matrix model for annual plant demography

Sun, 2014-08-31 05:59

(This article was first published on ЯтомизоnoR » R, and kindly contributed to R-bloggers)

Let’s challenge to build a matrix population model of annual organisms and then calculate the population growth rate λ using R.

Consider a simple life cycle of imaginary annual plants; that germinate in spring, flower in summer, produce seeds in autumn, and are dead in winter. Though parent plants are dead in winter, dormant seeds are alive in soil. Some of the seeds germinate in subsequent spring; others continue to be dormant.

The first thing to do is to build a stage-classified life cycle graph. The second is to define transition matrices that describe the graph. At last, solving the characteristic equation of the matrices gives a stable population growth rate, namely, λ.

1. Life cycle graph

Seasonal stages of life of such an annual plant are shown below.

Season Stages Spring seedlings, seeds Summer flowering plants, seeds Autumn seed production stages, seeds Winter seeds

A four seasonal life cycle graph is as below (Fig. 1).

2. Matrix

The life cycle graph of Fig. 1 is represented by four transition matrices (P, Q, R and S) and population structure vectors. The matrix P represents a transition from spring to summer. Q is from summer to autumn; R is from autumn to winter; and S is from winter to spring.

I chose four important components to describe the life cycle graph.

seed  <- 0.9^4  # Seed surviving rate; annual germ  <- 0.3    # Germination rate; spring plant <- 0.05   # Plant surviving rate; from germination to mature yield <- 100    # Seed production number; per matured plant

Additionally, pollination rate of flower (or maturing rate) was used to divide the plant surviving rate before and after flowering.

mature <- 0.5  # pollination rate

I defined here all scripts using 'function()'; functions without arguments.

For example, I wrote 'seedq <- function() seed^(1/4)', not 'seedq <- seed^(1/4)' nor 'seedq <- function(seed) seed^(1/4)'.

This was because I wanted them to be recalculated automatically when I changed the values of 5 parameters chosen above. And also I decided to make the parameters global for simplicity. This idea works for tiny scripts, but will not for a larger one. Guys loving principles of scripting, or using delicate functions such as 'apply’, may change these scripts not to use global variables.

seedq             <- function() seed^(1/4) # assume constant seed surviving rate through all seasons. p11               <- function() plant / mature q11               <- function() mature r11               <- function() yield s11               <- function() germ * seedq() s21               <- function() (1 - germ) * seedq() p22 <- q22 <- r12 <- function() seedq()

Then transition matrices of four seasons will be as follows (Fig. 3).

P <- function() matrix(data=c( p11(), 0    , 0    , p22() ), nrow=2, ncol=2, byrow=T) Q <- function() matrix(data=c( q11(), 0    , 0    , q22() ), nrow=2, ncol=2, byrow=T) R <- function() matrix(data=c( r11(), r12() ), nrow=1, ncol=2, byrow=T) S <- function() matrix(data=c( s11(), s21() ), nrow=2, ncol=1, byrow=T)

The population growth rate from year t to year t + 1 is represented by an annual transition matrix (Fig. 4).

For the annual plant above, there are four different annual cycles based on four start seasons (Fig. 5).

Annual transition matrices are calculated by products of these seasonal matrices (Fig. 6).

A.spring <- function() S() %*% R() %*% Q() %*% P() A.summer <- function() P() %*% S() %*% R() %*% Q() A.autumn <- function() Q() %*% P() %*% S() %*% R() A.winter <- function() R() %*% Q() %*% P() %*% S()

Annual matrix starting at winter is 1×1, namely, a scalar 1.81.

3. Population growth rate

Because no parent plants survive winter, the winter population consists of seed only. So this scalar is exactly the value of population growth λ (Fig. 7).

Every annual matrix A gives same λ regardless of start season, because the annual growing rate of the population must be consistent. Let’s check it.

For 2×2 matrix A (Spring, Summer and Autumn), a stable growth rate is a solution of the characteristic equation below (Fig. 8).

The λ, the stable growth rate, is the dominant eigenvalue. The w, the stable population structure, is the eigenvector corresponding to the λ.

lambda <- function(A) eigen(A)$values[1] # output of eigenvalues is decreasingly sorted. lambda(A.spring()) lambda(A.summer()) lambda(A.autumn()) lambda(A.winter())

The results are λ = 1.81 for all four cases.

Fortunately, R can handle more complex matrices than 2×2. This simple procedure will work for complex life cycles that have more than two stages or additional paths.

References
  1. Caswell, H., 2001. Matrix Population Models: Construction, Analysis, and Interpretation, Sinauer Associates.
    ISBN 9780878930968

 


To leave a comment for the author, please follow the link and comment on his blog: ЯтомизоnoR » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Beta binomial revisited

Sun, 2014-08-31 02:42

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

I got some comments on my priors of last week's post where better priors were proposed. Hence this week's post looks at them. After some minor adaptations, since I have nine beta parameters and beta needs to be fairly high, I can conclude that these priors provide significant improvements. Especially the model which has p(alpha,beta) prop to (alpha + beta)(-5/2) worked extremely well.
PriorsThere were two proposals. A common step was 'parameterizing in terms of mean theta = (alpha / (alpha + beta)) and total count kappa = (alpha + beta)'. In my own data there are nine regions hence nine sets of theta, kappa, alpha, beta. As hyperprior over the nine regions' incidence I chose a normal distribution for theta.  The difference between the proposals is in kappa. I made no hyperpriors over the nine values of kappa.
Prior: Uniform on log scale 'Andrew (...) suggests taking the priors on alpha and beta to be uniform on the log scale.' In code this means log kappa is uniform distributed.
Prior: p(alpha,beta) prop to (alpha + beta)(-5/2)Apparently in Bayesian Data Analysis 3 it is proposed to make p(alpha,beta) prop to (alpha + beta)(-5/2). This is handled via a Pareto distribution.
Sampling kappa priorsVia a small program I had a look what this means in terms of samples. The uniform log scale seemed to be reasonable even for the range of my beta parameters. On the other hand the p(alpha,beta) prop to (alpha + beta)(-5/2) proposal kept kappa way too low for the data at hand. This seems informative for a different range of kappa and hence beta rather than uninformative. There I made a different variation.
modelk <- '
parameters {
    real kappa_log1;
    real kappa2;
    real <lower=10^6> kappa3;
}
transformed parameters {
    real kappa1;
    kappa1 <- exp(kappa_log1);
}
model {      
    kappa_log1 ~ uniform(0,18);
    kappa2 ~ pareto(0.1,1.5);
    kappa3 ~ pareto(10^6,1.5);
}
'
parametersk <- c('kappa1','kappa2','kappa3')
initsk <- function() {
    list(
        kappa_log1=runif(1,0,18)
        , kappa2 = runif(1,.2,10)
        , kappa3 = runif(1,2e6,1e7)
    )
}

kfits <- stan(model_code = modelk,
        pars=parametersk,
        inits=initsk)
kfits
Inference for Stan model: modelk.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean   se_mean         sd       2.5%        25%        50%
kappa1 1994621.64 550416.10 7754182.44       1.57      82.85    4772.51
kappa2       0.24      0.02       0.27       0.10       0.12       0.16
kappa3 2602295.43 296845.16 4314974.09 1021790.77 1209898.43 1558773.41
lp__       -18.81      0.16       1.65     -23.11     -19.64     -18.41
              75%       97.5% n_eff Rhat
kappa1  160223.83 25951765.02   198 1.03
kappa2       0.25        0.89   123 1.03
kappa3 2490623.45 11991112.38   211 1.02
lp__       -17.59      -16.83   112 1.03

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:12:51 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Effect in the whole modelPrior: Uniform on log scale This works reasonable well. With significantly less samples than last week there are reasonable estimates. The only thing worrying is that incidence[1] (it was named betamean[1] in the previous models, but that was actually a bit of a misnomer) is a bit too high.
 model1 <- '
    data {
       int<lower=0> n;
       int<lower=0> nregion;
       int count[n];
       int<lower=1,upper=nregion> Region[n];
       int Population[n];
       real scale;
        }
    parameters {
       vector <lower=0,upper=1> [nregion]  theta;
       vector [nregion]  kappa_log;
       vector <lower=0,upper=1> [n] p1;
       real <lower=0,upper=1> thetam;
       real <lower=0> thetasd;
        }
  transformed parameters {
       vector [nregion] kappa;
       vector [nregion] a;
       vector [nregion] b;
       kappa <- exp(kappa_log);
       for (i in 1:nregion) {
          a[i] <- kappa[i] * theta[i];
          b[i] <- kappa[i] * (1 - theta[i]);
       }
  }
  model {     
        kappa_log ~ uniform(12,18);
        for (i in 1:n) {
            p1[i] ~ beta(a[Region[i]],b[Region[i]]);
        }
        count ~ binomial(Population,p1);
        theta ~ normal(thetam,thetasd);
        thetam ~ uniform(0,1);
        thetasd ~ uniform(0,.1);      
        }
  generated quantities {
        vector [n] pp;
        vector [nregion] incidence;
        real meaninc;
        incidence <- scale*theta;
        pp <- p1 * scale;
        meaninc <- scale*thetam;
        }
'

inits1 <- function()
    list(theta=rnorm(datain$nregion,1e-6,1e-7),
         thetam = rnorm(1,1e-6,1e-7),
         thetasd = rnorm(1,1e-6,1e-7),
         kappa_log=runif(datain$nregion,15,18),
         p1=rnorm(datain$n,1e-7,1e-8))

parameters1 <- c('a','b','incidence','kappa','kappa_log','meaninc')

fits1 <- stan(model_code = model1,
    data = datain,
    pars=parameters1    ,
    init=inits1)
print(fits1,probs=NA)
Inference for Stan model: model1. 
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean    se_mean          sd    n_eff Rhat
a[1]                7.95       1.32       13.67 NA   107 1.04
a[2]                9.16       1.43       16.76 NA   138 1.02
a[3]               50.52       2.61       32.13 NA   151 1.03
a[4]               35.68       2.25       32.61 NA   210 1.01
a[5]                9.10       1.32       13.99 NA   113 1.04
a[6]               11.99       1.48       17.40 NA   139 1.01
a[7]               13.64       1.83       20.56 NA   126 1.01
a[8]               14.19       2.56       20.35 NA    63 1.05
a[9]               12.14       2.51       21.39 NA    73 1.05
b[1]          5956080.76 1213799.11 10903417.54 NA    81 1.04
b[2]          5293475.42  844235.36  9984631.29 NA   140 1.01
b[3]         28376424.43 1399820.31 17900739.23 NA   164 1.03
b[4]         19757543.83 1195288.60 17665183.92 NA   218 1.02
b[5]          5493903.33  897750.46  9136674.30 NA   104 1.05
b[6]          6449567.62  787646.39  9190268.32 NA   136 1.01
b[7]          8500594.37 1172609.55 12531293.24 NA   114 1.01
b[8]          8901106.72 1698740.11 12644831.14 NA    55 1.06
b[9]          7267575.15 1406258.66 12121859.82 NA    74 1.05
incidence[1]        0.10       0.00        0.02 NA    84 1.07
incidence[2]        0.11       0.00        0.02 NA   177 1.02
incidence[3]        0.11       0.00        0.01 NA   153 1.01
incidence[4]        0.11       0.00        0.01 NA   178 1.02
incidence[5]        0.11       0.00        0.02 NA   216 1.01
incidence[6]        0.12       0.00        0.02 NA   121 1.03
incidence[7]        0.10       0.00        0.02 NA   161 1.02
incidence[8]        0.10       0.00        0.02 NA    81 1.06
incidence[9]        0.10       0.00        0.02 NA   119 1.04
kappa[1]      5956088.71 1213800.44 10903430.55 NA    81 1.04
kappa[2]      5293484.58  844236.78  9984647.90 NA   140 1.01
kappa[3]     28376474.95 1399822.89 17900770.95 NA   164 1.03
kappa[4]     19757579.51 1195290.81 17665216.05 NA   218 1.02
kappa[5]      5493912.43  897751.76  9136688.06 NA   104 1.05
kappa[6]      6449579.61  787647.86  9190285.60 NA   136 1.01
kappa[7]      8500608.01 1172611.40 12531313.60 NA   114 1.01
kappa[8]      8901120.91 1698742.67 12644851.22 NA    55 1.06
kappa[9]      7267587.30 1406261.16 12121881.04 NA    74 1.05
kappa_log[1]       14.56       0.18        1.40 NA    60 1.08
kappa_log[2]       14.37       0.13        1.42 NA   122 1.01
kappa_log[3]       16.88       0.07        0.87 NA   167 1.03
kappa_log[4]       16.28       0.08        1.15 NA   210 1.01
kappa_log[5]       14.76       0.11        1.17 NA   119 1.03
kappa_log[6]       15.10       0.08        1.03 NA   153 1.01
kappa_log[7]       15.08       0.17        1.36 NA    61 1.03
kappa_log[8]       15.12       0.22        1.38 NA    41 1.08
kappa_log[9]       14.76       0.17        1.45 NA    70 1.04
meaninc             0.11       0.00        0.01 NA   107 1.02
lp__            -7659.18       1.14       11.63 NA   105 1.01

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:24:52 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Prior: p(alpha,beta) prop to (alpha + beta)(-5/2)
This model did quite well. I wish all my models worked as well.
model2 <- '
data {
    int<lower=0> n;
    int<lower=0> nregion;
    int count[n];
    int<lower=1,upper=nregion> Region[n];
    int Population[n];
    real scale;
}
parameters {
    vector <lower=0,upper=1> [nregion]  theta;
    vector <lower=1e6> [nregion]  kappa; 
    vector <lower=0,upper=1> [n] p1;
    real <lower=0,upper=1> thetam;
    real <lower=0> thetasd;
}
transformed parameters {
    vector [nregion] a;
    vector [nregion] b;
    for (i in 1:nregion) {
        a[i] <- kappa[i] * theta[i];
        b[i] <- kappa[i] * (1 - theta[i]);
    }
   
}
model {      
    kappa ~ pareto(10^6,1.5);
    for (i in 1:n) {
        p1[i] ~ beta(a[Region[i]],b[Region[i]]);
    }
    count ~ binomial(Population,p1);
    theta ~ normal(thetam,thetasd);
    thetam ~ uniform(0,1);
    thetasd ~ uniform(0,.1);       
}
generated quantities {
    vector [n] pp;
    vector [nregion] incidence;
    real meaninc;
    incidence <- scale*theta;
    pp <- p1 * scale;
    meaninc <- scale*thetam;
}
'

inits2 <- function()
    list(theta=rnorm(datain$nregion,1e-6,1e-7),
        thetam = rnorm(1,1e-6,1e-7),
        thetasd = rnorm(1,1e-6,1e-7),
        kappa=runif(datain$nregion,1e6,1e7),
        p1=rnorm(datain$n,1e-7,1e-8))

parameters2 <- c('a','b','incidence','kappa','meaninc')

fits2 <- stan(model_code = model2,
    data = datain,
    pars=parameters2,
    init=inits2)

print(fits2,probs=NA)
Inference for Stan model: model2.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean   se_mean          sd    n_eff Rhat
a[1]                3.03      0.06        2.54 NA  1851    1
a[2]                3.36      0.05        2.74 NA  2746    1
a[3]               22.28      1.39       39.52 NA   807    1
a[4]                7.27      0.27       12.39 NA  2173    1
a[5]                3.61      0.08        3.50 NA  2025    1
a[6]                4.38      0.09        3.70 NA  1743    1
a[7]                3.58      0.13        4.14 NA   957    1
a[8]                3.54      0.11        3.60 NA  1125    1
a[9]                3.20      0.10        3.81 NA  1346    1
b[1]          2132903.53  47586.20  2085059.70 NA  1920    1
b[2]          1804395.74  29404.95  1540741.77 NA  2745    1
b[3]         12482567.37 782922.81 22226255.17 NA   806    1
b[4]          4057579.18 146369.50  6818624.58 NA  2170    1
b[5]          2080275.79  44577.27  2020891.35 NA  2055    1
b[6]          2368569.05  48762.30  1922950.01 NA  1555    1
b[7]          2281906.54  81326.60  2583506.11 NA  1009    1
b[8]          2319737.52  71109.80  2408437.85 NA  1147    1
b[9]          2092476.48  61180.23  2313166.69 NA  1430    1
incidence[1]        0.09      0.00        0.02 NA   746    1
incidence[2]        0.12      0.00        0.02 NA  1173    1
incidence[3]        0.11      0.00        0.02 NA  1209    1
incidence[4]        0.11      0.00        0.02 NA  1387    1
incidence[5]        0.11      0.00        0.02 NA  1406    1
incidence[6]        0.12      0.00        0.02 NA  1248    1
incidence[7]        0.10      0.00        0.02 NA   999    1
incidence[8]        0.10      0.00        0.02 NA  1063    1
incidence[9]        0.10      0.00        0.02 NA   935    1
kappa[1]      2132906.56  47586.25  2085062.04 NA  1920    1
kappa[2]      1804399.10  29405.01  1540744.41 NA  2745    1
kappa[3]     12482589.65 782924.19 22226294.41 NA   806    1
kappa[4]      4057586.45 146369.76  6818636.82 NA  2170    1
kappa[5]      2080279.40  44577.35  2020894.78 NA  2055    1
kappa[6]      2368573.43  48762.38  1922953.62 NA  1555    1
kappa[7]      2281910.12  81326.73  2583510.17 NA  1009    1
kappa[8]      2319741.06  71109.91  2408441.36 NA  1147    1
kappa[9]      2092479.69  61180.33  2313170.41 NA  1430    1
meaninc             0.11      0.00        0.01 NA   901    1
lp__            -7881.32      0.42        8.66 NA   428    1

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:34:49 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Forecasts After Marina’s Turbulence

Sat, 2014-08-30 15:45

(This article was first published on Daniel MarcelinoDaniel Marcelino » R, and kindly contributed to R-bloggers)

It’s more difficult than ever to tell who is going to continue in the runoff after the October 5th. After Marina’s campaign mate’s death, a big chunk of the electorate have been persuaded towards the alternative between PT and PSDB.

The values I used to start the chains, my initial beliefs, were that Dilma would receive somewhat 40%, Aécio ~ 25%, Campos/Marina ~ 20%, Others ~ 6, and Nulls ~ 8%. The MCMC was run for 1000 election, with a learning period of 500.

The following charts were produced using the forecast model I’ve been working on, which includes national as well state polls. Note that the model is naïve as it doesn’t include any other source of information. That is, I’m attempting to forecast the election based only on the mood of the electorate as captured by many pollsters.

The dots crossed by a vertical line on the charts represent my prior expectations for the election outcome. The colored area represents 95% of confidence interval for the day-to-day changes on the vote intentions. The “+” signs show the measurement evidence from the polls. The area between the last poll and the dots (roughly the September month) is the uncertain are I’m trying to foresee.

Note: Although Marina should have be lower according to my expectations, she is actually going up. She can even go further given the bad news regarding the economy performance, which the government doesn’t seem to have a coeherent explanation for the poor growth. Some says the World Cup was the vice, other say the World Cup was a virtue, and still others insist to say that the culprit is the international crisis. That one in 2008.

DILMA ROUSSEFF (PT)

MARINA SILVA (PSB)

AÉCIO NEVES (PSDB)

OTHERS

NULLS

To leave a comment for the author, please follow the link and comment on his blog: Daniel MarcelinoDaniel Marcelino » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Age-Length Keys in FSA

Sat, 2014-08-30 09:12

(This article was first published on fishR » R, and kindly contributed to R-bloggers)

I have updated the age-length key related functions in FSA and the age-length key vignette.  The details are below.   Let me know if you have questions or suggestions.

FSA Changes

All code that used ageKey(), ageKeyPrep(), and ageKeyPlot() should be updated!!  See below.

  • ageKey() — DEPRECATED (will be removed by v1.0.0).  See alkIndivAge().  I made this name change because as I added more age-length key functionality (see below) the very generic sounding ageKey() seemed too broad for the specificity of its purpose (apply the Isermann and Knight method).
  • alkIndivAge() — Added, was ageKey().  Added seed= which can be set to a number to serve as a random seed which will allow reproducibility of results.  Made changes to minimize spurious warnings.  Modified to better handle when an age-length key has a whole row of missing data (as would happen if as.fact=TRUE and drop.levels=FALSE in lencat()).
  • ageKeyPlot() — DEPRECATED (will be removed by v1.0.0).  See alkPlot().  Changed to be consistent with the other functions.
  • alkPlot() — Removed bubble.ylab= and, accordingly, modified ylab=.  Added the ability to remove the legend from area plots and added pal= so that the user can choose a palette for the different colors of areas, bars, and lines in the various plots.  Added some checks on the structure of the age-length key.
  • ageKeyPrep() — DEPRECATED (will be removed by v1.0.0).  See alkPrep().  Changed to be consistent with the other functions.
  • alkPrep() — Added, was ageKeyPrep().  Added some checks on the structure of the age-length key.
  • alkAgeDist() — Added this function to compute the proportion of fish in each age in the larger sample, with a SE, following the methods of Quinn and Deriso.
  • alkMeanVar() — Added this function to compute the mean of a variable at each age (e.g., mean length-at-age) for the larger sample following the methods of Bettoli and Miranda.
  • Further changes can be seen in the NEWS file.
Vignette Changes

Modified the description of ALK dramatically by developing the general method first and then proceeding to the Isermann and Knight method (the original vignette was completely focused on the Isermann and Knight method).  Added sections for computing the proportion of fish at each age and the mean of a variable at each age for other than the Isermann and Knight method (see alkAgeDist() and alkMeanVar() above).  Demonstrated one method for computing smooth ALKs.  Demonstrated, with a smooth ALK, how to statistically compare two or more ALKs.

The new vignette will be a chapter in a new book and is available here.


Filed under: Administration, Fisheries Science, R Tagged: Age, Age-Length Key, FSA, R

To leave a comment for the author, please follow the link and comment on his blog: fishR » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

pbapply()

Fri, 2014-08-29 23:23

(This article was first published on Analyst At Large » R, and kindly contributed to R-bloggers)

In my previous post, I wrote about displaying a Windows progress bar (useful when going through long for-loops).  In the comments, a couple of users recommended that I check out pbapply().  This function is great too (and works on all platforms)!  Basically, it allows you to perform all the apply() functions – apply(), lapply(), sapply(), etc. and adds in a progress bar.  Here is some code showing a very simple example:

###### Packages library(pbapply)   ###### Data n<-1000 mat<-matrix(rnorm(n),ncol=10,nrow=100)   ###### apply() vs pbapply() x<-pbapply(mat,1,sum) y<-apply(mat,1,sum)   x-y

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Analyst At Large » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

BH release 1.54.0-4

Fri, 2014-08-29 22:02

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Another small new release of our BH package providing Boost headers for use by R is now on CRAN. This one brings a one-file change: the file any.hpp comprising the Boost.Any library --- as requested by a fellow package maintainer needing it for a pending upload to CRAN.

No other changes were made.

Changes in version 1.54.0-4 (2014-08-29)
  • Added Boost Any requested by Greg Jeffries for his nabo package

Courtesy of CRANberries, there is also a diffstat report for the most recent release.

Comments and suggestions are welcome via the mailing list or issue tracker at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

LED is my new Hello World – R Time

Fri, 2014-08-29 14:13

(This article was first published on Blag's bag of rants, and kindly contributed to R-bloggers)

Yep...as I have said many times before, the LED Number application has become my new Hello World...whenever I learn a new programming language I try to build this, because it compromises several language constructs and makes you learn more by trying to figure out how to build it again...

Yesterday I blog about how to build it using Haskell...and...I'm still learning Haskell...so I thought..."It's cool that I'm using this for my new programming languages...but what about the old ones?"

I suddenly realized that I had never wrote this using R...and that's really sad...I love R -:)

Anyway...here it goes -;)

LED.Rline<-c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
index<-c(0,0,0,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,9,9,9)
map<-c(" _ ","| | ","|_| "," ","| ","| "," _ "," _| ","|_ ",
" _","_| ","_| "," ","|_| "," | "," _ ","|_ "," _| ",
" _ ","|_ ","|_| "," _ "," | "," | "," _ ","|_| ","|_| ",
" _ ","|_| "," _| ")

Lines<-cbind(line,index,map)
DF_Lines<-data.frame(Lines,stringsAsFactors=F)

line1<-""
line2<-""
line3<-""

p_number<-readline("Enter a number?: ")
l_number<-strsplit(p_number,"")
for(i in 1:nchar(p_number)){
num_map<-DF_Lines[DF_Lines$index == as.numeric(l_number[[1]][i]),]
line1<-paste(line1,num_map$map[1])
line2<-paste(line2,num_map$map[2])
line3<-paste(line3,num_map$map[3])
}
lines = paste(line1,"n",line2,"n",line3,"n")
cat(lines)

Check the picture -;)


Greetings,
Blag.Development Culture.

To leave a comment for the author, please follow the link and comment on his blog: Blag's bag of rants. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs