R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 55 min 53 sec ago

Best practices for logging computational systems in R and Python

Thu, 2016-07-07 16:02

(This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers)

As is the case with most quant software, it’s a bit different from run-of-the-mill software. The somewhat prosaic world of logging is one such place where there are some differences. What’s different about quant systems? First, they have multiple run modes. Particularly in finance, models often run in real-time but also historically. Models may also run incrementally or in batch. These scenarios each have their own logging needs. In particular, real-time processes typically need to show information about every event, whereas batch processes might only log every events/samples. Alternatively, incremental processes may require a lot of debugging information, whereas the performance of batch processes will suffer with too much logging.

A good approach that balances these different scenarios is to log messages to both the console and a file. For interactive use the console is most convenient since the feedback is immediate. For long batch processes or running as a system, having the information in a log file is ideal. Trying to search for a log event in the console is a good way to waste a lot of time, whereas as a file, you can take advantage of standard UNIX tools like grep and sed. Another nice thing about two loggers is that you can assign different log levels to each one. The console can be verbose, while the file can be more terse to ease searching for information. The same approach works in both Python and R.

Python

Logging comes as “included batteries” in Python. The stock logging package is powerful and flexible. The drawback is that it can be onerous to setup. In large systems, it’s not uncommon to see a dozen logging configurations interspersed in code, each with their own setup, message format, and log levels. It’s much better to standardize the logging and apply it once at the top-level of the application. I usually talk about software being like an onion or a cabbage, with multiple layers of abstraction. The outer layers are tough and designed to protect against the outside world. As you go towards the center, it is clean and pure. The consequence is that the core assumes correct data types and whatnot, while it’s the responsibility of the outer layers to ensure proper data types and formats. This makes testing easier, which I discussed in a prior post. It also makes it easier to re-use functions since the pure core is more readily shared, while the use-case-specific or environment-specific code is wrapped up in the outer layers.

Anyway, back to logging. Following this principle, logging configuration is application specific, so it should be in the outer layers. In Python, there is usually a main function that acts as the entry point of a script, such as

if __name__ == '__main__': # Call app entry point

This is a good place to initialize your logging. Then all modules within your package/system assume logging is configured. If it isn’t Python will tell you that logging hasn’t been initialized, so you know when this assumption is faulty.

Python logging demands intialization. The simplest is to use a one-liner such as logging.basicConfig(level=logging.INFO), which sets the ROOT logger level to INFO and prints to stdout. But you’ll grow out of this in a few days, particularly because the default message format is not great. Better to use a configuration file to manage the logging system. The default ini-style log configurations are somewhat arcane and opaque due to the way sections have to be referenced by other variables. The newer YAML syntax is more intuitive. As a markup language YAML is becoming more commonplace, so it’s a reasonable replacement.

Our goals for logging include:

  1. define a useful message format that is both informative and easy to read;
  2. a verbose logger that writes to the console;
  3. a less verbose logger that writes to a file.

The following configuration satisfies the above goals. Typically, this configuration can be placed in a conf directory that contains all configuration files.

version: 1 formatters: simpleFormater: format: '%(levelname)7s [%(asctime)s] %(name)s: %(message)s' datefmt: '%Y-%m-%d %H:%M:%S' handlers: console: class: logging.StreamHandler formatter: simpleFormater stream: ext://sys.stdout file: level: INFO class : logging.FileHandler formatter: simpleFormater filename: example.log loggers: clogger: handlers: [console] flogger: handlers: [file] root: level: DEBUG handlers: [console, file]

Let’s call this config file example_log.yaml. Using the logging configuration is easy: open the file and tell the logging system to use it.

import logging.config, yaml with open('example_log.yaml') as f: D = yaml.load(f) logging.config.dictConfig(D)

Now in your modules, you create a logger and log messages as regular.

logger = logging.getLogger(__name__) logger.info("This is an INFO log message") logger.warning("This is a WARNING message") logger.debug("This is a DEBUG message")

In this example, all three messages will be output to the console, but only two will be written to disk.

This approach works for most purposes. If you want to change the name of the log file, update the filename property of the file handler from example.log to your preferred log file name. For additional customizations, refer to the Python Logging Cookbook.

R

In R, logging is not a in the included batteries. To fill this void, I wrote the futile.logger package to provide an easy to use logging facility in R. This package has similar semantics to Python’s logging. The primary difference is that a convenient initial configuration is provided by default. In fact, to get console output, no configuration is needed at all! For example, after installing from CRAN or github, you can log immediately to the console with a threshold of INFO

> flog.info("My first log message") INFO [2016-07-07 15:44:44] My first log message

futile.logger also supports multiple loggers and can write to files. All interaction is done via functions, where a logger is specified. By default, the ROOT logger is used, so if you don’t need more than that, don’t worry about the logger name. For example, to change the log threshold to DEBUG, use flog.threshold.
> flog.threshold(DEBUG)
> flog.debug(“My debug log message”)
DEBUG [2016-07-07 15:52:21] My debug log message
[/sourcecode]

This means during an interactive run, you can quickly turn on DEBUG level messages, while the normal code runs at INFO level. If you want to write to both console and file, specify a new appender using flog.appender. Many appenders come bundled with futile.logger. Here we’ll use the tee appender to write to console and file.

> flog.appender(appender.tee('example.log')) NULL > flog.info("Writing to file") INFO [2016-07-07 15:54:16] Writing to file

Unlike Python logging, futile.logger does not support configuration files. Instead, a code-based configuration must be used. The same principle can be followed though, where the configuration should be at the top level of the code. For R packages, this can be in the .onLoad(libname, pkgname) function. More information is available in the source.

Other Considerations

While logging greatly helps debugging code and understanding what a system is doing, there are a few things to be wary of. First, make sure not to accidentally commit your log files into your repository. The best way to prevent this is to add *.log in your .gitignore file.

Conclusion

With a bit of forethought, logging can be beneficial to your system and development process. By using the recipes above, you can quickly take advantage of logging in your data science projects.

To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R Competition on education in South Africa (July and August 2016)

Thu, 2016-07-07 15:03

(Guest post by Bartosz Sękiewicz)

We invite you to participate in our Kaggle-style R competition, an online team competition (1-3 people) which is based in Poland though we would welcome international teams. It will take place during July and August 2016. The organisers are Do-IT Solutions Ltd and eRka (Cracow R User Group), who have decided to join forces again after our successful marathon data analysis in Cracow in July 2015.

On the (eRka Website) you will find an interview with Dr Ian Smythe – the main initiator of the competition – to help understand the background and why he thinks it has the potential to impact upon the lives of millions.

Due to the fact that it is a holiday period and the availability of some people may be limited, we decided to split the competition into three stages (awarded separately). For each of the steps, teams will be required to analyse previously unpublished data (most of these data has been collected this year) related to education in South Africa. A specially developed online Shiny app allows participants to familiarize themselves with the data quickly, so you can focus on creative exploration. Access to the app will be possible when the data is made available through GitHub.

The three competitions are:

  • Gender bias in cognitive tasks (mental rotation) – Open: 8 July, Deadline: 24 July
  • Scoring maths tests using errors and time – Open: 22 July, Deadline: 7 August
  • The socio-economic impact on literacy development – Open: 5 August, Deadline: 21 August

The only accepted tool to generate competition results is R. Other tools can be used to clean up the data for example, but this should be clearly marked, and any modifications to the data set need to be submitted with results. (N.B. Due to the method of data collection, the results are already very clean.)

The best teams will be awarded small prizes, to be presented during the ‘ceremony’, which will be held in late August in Krakow. It is likely that the results of the competition will be published in scientific journals.

If you are interested in cooperation outside this competition, either as a private individual or maybe as a lecturer, there are many more data sets, and many more worthy questions seeking creative answers. For further details, please contact Ian Smythe (ian[a]doitprofiler[dot]com).

For more information and contest rules (version English and Polish) can be found at www.doitprofiler.com/rkrakow

Categories: Methodology Blogs

useR! 2016 Tutorials: Part 2

Thu, 2016-07-07 11:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Last week, I mentioned a few of the useR tutorials that I had the opportunity to attend. Here are the links to the slides and code for all but two of the tutorials:

Regression Modeling Strategies and the rms PackageFrank Harrell
Using Git and GitHub with R, RStudio, and R Markdown – Jennifer Bryan
Effective Shiny Programming – Joe Cheng
Missing Value Imputation with R -Julie Josse
Extracting data from the web APIs and beyond – Ram, Grolemund & Chamberlain
Ninja Moves with data.table – Learn by Doing in a Cookbook Style Workshop – Matt Dowle
Never Tell Me the Odds! Machine Learning with Class Imbalances – Max Kuhn
MoRe than woRds, Text and Context: Language Analytics in Finance with R – Das & Mokashi
Time-to-Event Modeling as the Foundation of Multi-Channel Revenue Attribution – Tess Calvez
Handling and Analyzing Spatial, Spatiotemporal and Movement Data – Edzer Pebesma
Machine Learning Algorithmic Deep Dive – Erin LeDell
Introduction to SparkR – Venkataraman & Falaki
Using R with Jupyter Notebooks for Reproducible Research de Vries & Harris
Understanding and Creating Interactive Graphics Part 1, Part 2– Hocking & Ekstrom
Genome-Wide Association Analysis and Post-Analytic Interrogation Part 1, Part 2 – Foulkes
An Introduction to Bayesian Inference using R Interfaces to Stan – Ben Goodrich
Small Area Estimation with R – Virgilio Gómez Rubio
Dynamic Documents with R Markdown – Yihui Xie

Granted that since the tutorials were not videotaped they mostly fall into the category of a "you had to be there" experience. However, many of the presenters put a significant effort into preparing their talks and collectively they comprise a rich resource that is worth a good look. Here are just of couple examples of what is to be found.

The first comes from Julie Josse's Missing Data tutorial where a version of the ozone data set with missing values is used to illustrate a basic principle of exploratory data analysis: visualize your data and look for missing values. If there are missing values try to determine if there are any patterns in their location.

maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v 20010601 87 15.6 18.5 NA 4 4 8 0.6946 -1.7101 -0.6946 84 20010602 82 NA NA NA 5 5 7 -4.3301 -4.0000 -3.0000 87 20010603 92 15.3 17.6 19.5 2 NA NA 2.9544 NA 0.5209 82 20010604 114 16.2 19.7 NA 1 1 0 NA 0.3473 -0.1736 92 20010605 94 NA 20.5 20.4 NA NA NA -0.5000 -2.9544 -4.3301 114 20010606 80 17.7 19.8 18.3 6 NA 7 -5.6382 -5.0000 -6.0000 94

These first two plot from made with the aggr() function in the VIM package shows proportion of missing values for each variable and relationship of missing values among all of the variables.

The next plot shows a scatter plot of two variables along boxplots along the margins that show the distributions of missing values for each variable. (Here blue represents data that are present and red the missing values.) The code to do this and many more advanced analyses is included on the tutorial page.

It looks like missing values are pretty much spread among the data.

Frank Harrell's tutorial provides a modern look at regression analysis from a statisticians point of view. The following plot comes from the section of his tutorial on Modeling and Testing Complex Interactions. If you haven't paid much attention to the the theory behind interpreting linear models in a while you may find this interesting.

Finally, I had one of those "Aha" moments right at beginning of Ben Goodrich's presentation on Bayesian modeling. MCMC methods work by simulating draws from a Markov chain whose limiting distribution converges to the distribution of interest. This technique works best when the simulated draws are able to explore the entire space of the target distribution. In the following the figure, the target is the bivariate normal distribution on the far right. Neither the Metropolis nor Gibbs Sampling algorithms come close to sampling from the entire target distribution space, but the Hamiltonian Monte Carlo "NUTS" algorithm in the STAN package displays very good coverage. 

For reasons I described last week I believe that this year's useR tutorial speakers have raised the bar on both content and presentation. I am going to do my best to work through these before attending next year's conference in Brussels. 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Bad ways to run a user group

Thu, 2016-07-07 08:53

(This article was first published on R – It's a Locke, and kindly contributed to R-bloggers)

I love user groups and I always want there to be more. I’m not a perfect organiser but I run reasonable groups. When I see organisers doing it badly, it makes me sad. There’s lots of great ways to run a user group, but I thought I’d cover some of the bad ways to run a user group. The anti-patterns if you will

Don’t advertise

Your group isn’t on Twitter. Event notifications don’t get posted on local mailing lists / Slack groups. Your group basically runs on Fight Club rules*, and you wonder why you don’t get new people attending.

Turn it around: Getting started with social media

Advertise badly

You post notifications via arcane methods with low readership or high barriers to entry. You only post in the tiny LinkedIn group you set up. You don’t include vital information like the location of the event. In short, you waste your time and nobody sees your efforts.

Turn it around: Infographic on improving social media use

Make your own site

You bodge together a ’90s site. You never update it. You don’t include event pages. Your SEO is poor. You spend lots of time making and maintaining this thing or worse you spend no time making and maintaining it. Nobody finds your lone site.

Turn it around: Use Meetup

Promote uncertainty

You don’t keep an archive of past events. You don’t post an agenda. You don’t include important info like the language the talk is in. You don’t let people know what to expect when they show up.

Turn it around: Event description writing tips

Be awkward

You organise the group to make it convenient for you attend. You throw it at your out of city center venue. You throw it during the day, or on weekends. Oddly nobody else shows up at that out of down campus during the working day.

Turn it around: Picking the date

Write negatively

The event description is mainly acronyms. There are references to getting drunk. You’re making jokes at someone’s expense. No tolerance of newbs is shown. You say you expect few people to bother turning up. You do your damnedest to discourage people who aren’t like you.

Turn it around: Event Organisers Considerations

Operate in a vacuum

You don’t talk to other local user groups. You don’t consider other events when you set a date and clash. You don’t network and gain contacts for potential speakers. You’re the only one who talks. You don’t ask for feedback from the people who show up.

Turn it around: Find new speakers

What other bad ways of running a user group have you seen? Can you recommend extra resources for people looking to do better? Comment below!

* You don’t talk about Fight Club

The post Bad ways to run a user group appeared first on It's a Locke.

To leave a comment for the author, please follow the link and comment on their blog: R – It's a Locke. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Latest on the Julia Language (vs. R)

Wed, 2016-07-06 22:47

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

I’ve written before about the Julia language. As someone who is very active in the R community, I am biased of course, and have been (and remain) a skeptic about Julia. But I would like to report on a wonderful talk I attended today at Stanford. To my surprise and delight, the speaker, Viral Shah of Julia Computing Inc, focused on the “computer science-y” details, i.e. the internals and the philosophy, quite interesting and certainly very impressive.

I had not previously known, for instance, how integral the notion of typing was in Julia, e.g. integer vs. float, and the very extensive thought processses in the Julia group that led to this emphasis. And it was fun to see the various cool Julia features that appeal to a systems guy like me, e.g. instance view of the assembly language implemented of a Julia function.

I was particularly interested in one crucial aspect that separates R from other languages that are popular in data science applications — NA values. I asked the speaker about that during the talk, only to find that he had anticipated this question and had devoted space in his slides to it. After covering that topic, he added that this had caused considerable debate within the Julia team as to how to handle it, which turned out to be something of a compromise.

Well, then, given this latest report on Julia (new releases coming soon), what is MY latest? How do I view it now?

As I’ve said here before, the fact that such an eminent researcher and R developer, Doug Bates of the University of Wisconsin, has shifted his efforts from R to Julia is enough for me to hold Julia in high regard, sight unseen. I had browsed through some Julia material in the past, and had seen enough to confirm that this is a language to be reckoned with. Today’s talk definitely raised my opinion of the language even further. But…

I am both a computer scientist and a statistician. Though only my early career was in a Department of Statistics (I was one of the founders of the UC Davis Stat. Dept.), I have done statistics throughout my career. And my hybrid status plays a key role in how I view Julia.

As a computer scientist, especially one who likes to view things at the systems levels, Julia is fabulous. But as a statistician, speed is only one of many crucial aspects of the software that I write and use. The role of NA values in R is indispensable, I say, not something to be compromised. And even more importantly, what I call the “helper” infrastructure of R is something I would be highly loathe to part with, things like naming of vector elements and matrix rows for instance. Such things have led to elegant solutions to many problems in software that I write.

And though undoubtedly (and hopefully) more top statisticians like Doug Bates will become active Julia contributors, the salient fact about R, as I always say, is that R is written for statisticians by statisticians. It matters. I believe that R will remain the language of choice in statistics for a long time to come.

And so, though my hat is off to Viral Shah, I don’t think Julia is about to “go viral” in tne stat world in the foreseeable future.

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Fatal Police Shootings Across the U.S.

Wed, 2016-07-06 20:00

(This article was first published on data science ish, and kindly contributed to R-bloggers)

I have been full of grief and sadness and some anger in the wake of yet more videos going viral in the past couple days showing black men being killed by police officers. I am not an expert on what it means to be a person of color in the United States or what is or isn’t wrong with policing today here, but it sure feels like something is deeply broken. I was reminded today that the Washington Post is compiling a database of every fatal shooting in the United States by a police officer in the line of duty since January 1, 2015 and has made that database publicly available on GitHub.

Their own visualizations and reporting are online here and are great, and today I decided to make a flexdashboard exploring the Washington Post’s data set, as it exists right now.

You can see and interact with the flexdashboard here.

As I note there in the sidebar, these numbers are presented without any adjustment for demographics here in the U.S. If you look at the bar graph showing which states have the most fatal police shootings, those tend to be the highest population states. I have not yet done any analysis looking at, for example, which states have a disproportionate number of fatal police shootings or anything like that. Also, it isn’t entirely clear yet if there are issues with underreporting that might bias these results. Is a shooting of a certain kind less likely to be reported in this data set? But for all those caveats, it is a start, and certainly we want to know where we are to move forward to a more just and peaceful world.

The code for the flexdashboard is here at this Gist. I am happy to hear feedback or questions on it!

To leave a comment for the author, please follow the link and comment on their blog: data science ish. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Fast and Big Linear Model Fitting with bigmemory and RcppEigen

Wed, 2016-07-06 20:00

(This article was first published on jared huling, and kindly contributed to R-bloggers)

In a previous post, I went over the basics of linking up bigmemory and the eigen C++ library via RcppEigen. In this post I’ll take this a bit further by creating a version of the fastLm() function of RcppEigen that can accept bigmemory objects. By doing so, we will create a fast way to fit linear models using data which is too big to fit in RAM. With RcppEigen, fitting linear models using out-of-memory computation doesn’t have to be slow. The code for this is all on github in the bigFastlm package.

Before we even start, most of the work is already done as we’ll just need to change a few lines of the core C++ code of the fastLm() function so that it we can map the bigmemory pointer to data on disk to an eigen matrix object.

The core code of fastLm can be found here. The data object which is being loaded into C++ from R is mapped to an eigen matrix object at line 208 of fastLm.cpp.

const Map<MatrixXd> X(as<Map<MatrixXd> >(Xs));

We need to change the above code to

XPtr<BigMatrix> bMPtr(Xs); unsigned int typedata = bMPtr->matrix_type(); if (typedata != 8) { throw Rcpp::exception("type for provided big.matrix not available"); } const Map<MatrixXd> X = Map<MatrixXd>((double *)bMPtr->matrix(), bMPtr->nrow(), bMPtr->ncol() );

The above modification first takes Xs as an Rcpp external pointer object (XPtr) and then checks to make sure it’s a double type (for now I’m ignoring all other data types (int, etc) for simplicity). Now that X is a mapped eigen matix object which points to data on disk, what else do we need to do? Well, not much! We just need to make sure that the correct object types are defined for the R-callable function. To do this, we need to change

// [[Rcpp::export]] Rcpp::List fastLm_Impl(Rcpp::NumericMatrix X, Rcpp::NumericVector y, int type) { return lmsol::fastLm(X, y, type); }

To

// [[Rcpp::export]] RcppExport SEXP bigLm_Impl(SEXP X, SEXP y, SEXP type) { BEGIN_RCPP Rcpp::RObject __result; Rcpp::RNGScope __rngScope; Rcpp::traits::input_parameter< Rcpp::XPtr<BigMatrix> >::type X_(X); Rcpp::traits::input_parameter< Rcpp::NumericVector >::type y_(y); Rcpp::traits::input_parameter< int >::type type_(type); __result = Rcpp::wrap(lmsol::fastLm(X_, y_, type_)); return __result; END_RCPP }

infastLm.cpp. // [[Rcpp::export]] had some trouble doing this automatically, so the above is just what that should have created.

So now with the proper R functions to call this, we’re basically done. I had to create a few utility functions to make everything work nicely, but the main work is just the above.

One important detail: for now, we can only use the LLT and LDLT methods for computation, as the other decompositions create objects which scale with the size of the data, so for now I’m ignoring the more robust decompositions like QR. Perhaps someone else can figure out how to perform these in a memory-conscious manner.

Comparison with biglm

Now we’ll run a (perhaps not-so-fair) comparison with the biglm function of biglm. Specifically, we’ll use the biglm.big.matrix function provided by the biganalytics which interfaces bigmemory and biglm. The following code creates a bigmemory object on disk (actually, two because biglm requires the matrix object to contain the response, whereas bigFastlm requires that the response be an R vector. It’s hard to say which is a better design choice, but I’m sticking with the approach which doesn’t allow an R formula expression).

suppressMessages(library(bigmemory)) suppressMessages(library(biganalytics)) suppressMessages(library(bigFastlm)) nrows <- 1000000 ncols <- 100 bkFile <- "big_matrix.bk" descFile <- "big_matrix.desc" big_mat <- filebacked.big.matrix(nrow = nrows, ncol = ncols, type = "double", backingfile = bkFile, backingpath = ".", descriptorfile = descFile, dimnames = c(NULL, NULL)) set.seed(123) for (i in 1:ncols) big_mat[, i] = rnorm(nrows, mean = 1/sqrt(i)) * i bkFile <- "big_matrix2.bk" descFile <- "big_matrix2.desc" big_mat2 <- filebacked.big.matrix(nrow = nrows, ncol = ncols + 1, type = "double", backingfile = bkFile, backingpath = ".", descriptorfile = descFile, dimnames = c(NULL, NULL)) for (i in 1:ncols) big_mat2[, i + 1] = big_mat[, i] y <- rnorm(nrows) big_mat2[,1] <- y options(bigmemory.allow.dimnames=TRUE) colnames(big_mat2) <- c("y", paste0("V", 1:ncols)) options(bigmemory.allow.dimnames=FALSE) ## create formula for biglm form <- as.formula(paste("y ~ -1 +", paste(paste0("V", 1:ncols), collapse = "+")))

Now let’s see how biglm and bigLm (eigen + bigmemory) stack up. Note that this is an unfair comparison because biglm requires a formula argument and bigLm assumes you’re passing in a design matrix already and biglm uses the QR decomposition, which is slower than LLT or LDLT.

library(microbenchmark) res <- microbenchmark(biglm.obj <- biglm.big.matrix(form, data = big_mat2), bigLm.obj <- bigLm(big_mat, y), bigLm.obj2 <- bigLmPure(big_mat, y), ## a slightly faster version that doesn't check for intercept times = 10L) print(summary(res)[,1:7], digits = 4) ## expr min lq ## 1 biglm.obj <- biglm.big.matrix(form, data = big_mat2) 20.408 21.979 ## 2 bigLm.obj <- bigLm(big_mat, y) 1.720 1.769 ## 3 bigLm.obj2 <- bigLmPure(big_mat, y) 1.624 1.631 ## mean median uq max ## 1 22.987 22.788 23.981 25.995 ## 2 1.825 1.793 1.867 2.055 ## 3 1.673 1.669 1.693 1.741 max(abs(coef(biglm.obj) - coef(bigLm.obj))) ## [1] 5.551115e-17

bigLm seems to be quite a bit faster than biglm, but how fast is bigLm compared with fastLm (which requires the data to be loaded into memory)? It turns out it’s pretty close on my computer and I don’t even have anything fancy like a solid state drive.

suppressMessages(library(RcppEigen)) mat.obj <- big_mat[,] ## both using the LLT decomposition res <- microbenchmark(fastLm.obj.llt <- fastLm(mat.obj, y, method = 2L), # LLT Cholesky bigLm.obj.llt <- bigLm(big_mat, y), # LLT Cholesky fastLm.obj2 <- fastLmPure(mat.obj, y, method = 2L), bigLm.obj2 <- bigLmPure(big_mat, y), ## a slightly faster version that doesn't check for intercept fastLm.obj.ldlt <- fastLmPure(mat.obj, y, method = 3L), # LDLT Cholesky bigLm.obj.ldlt <- bigLmPure(big_mat, y, method = 1L), # LDLT Cholesky fastLm.obj.qrpiv <- fastLmPure(mat.obj, y, method = 0L), # column-pivoted QR fastLm.obj.qr <- fastLmPure(mat.obj, y, method = 1L), # unpivoted QR times = 25L) print(summary(res)[,1:7], digits = 4) ## expr min lq ## 1 fastLm.obj.llt <- fastLm(mat.obj, y, method = 2L) 4.517 4.732 ## 2 bigLm.obj.llt <- bigLm(big_mat, y) 1.726 1.784 ## 3 fastLm.obj2 <- fastLmPure(mat.obj, y, method = 2L) 1.629 1.668 ## 4 bigLm.obj2 <- bigLmPure(big_mat, y) 1.611 1.667 ## 5 fastLm.obj.ldlt <- fastLmPure(mat.obj, y, method = 3L) 1.598 1.658 ## 6 bigLm.obj.ldlt <- bigLmPure(big_mat, y, method = 1L) 1.617 1.690 ## 7 fastLm.obj.qrpiv <- fastLmPure(mat.obj, y, method = 0L) 12.072 13.119 ## 8 fastLm.obj.qr <- fastLmPure(mat.obj, y, method = 1L) 9.386 9.772 ## mean median uq max ## 1 5.390 5.658 5.849 6.728 ## 2 1.932 1.820 1.878 3.325 ## 3 1.719 1.678 1.709 2.351 ## 4 1.754 1.677 1.742 2.266 ## 5 1.741 1.677 1.749 2.373 ## 6 1.780 1.726 1.829 2.145 ## 7 13.240 13.195 13.356 14.714 ## 8 10.228 10.135 10.486 12.356 max(abs(coef(fastLm.obj.llt) - coef(bigLm.obj.llt))) ## [1] 0

Future work would be to try to figure out how to make the QR decomposition memory-feasible and also to write a function for generalized linear models.

To leave a comment for the author, please follow the link and comment on their blog: jared huling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

How many calories should you eat per day?

Wed, 2016-07-06 19:02

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

US GOVERNMENT GUIDELINES BY AGE, SEX, ACTIVITY LEVEL


Click to enlarge

At Decision Science News, we are always on the lookout for rules of thumb.

Our colleague Justin Rao was thinking it would be useful to express calories as a percentage of daily calories. So instead of a coke being 150 calories, you could think of it as 7.5% of your daily calories. Or whatever. The whatever is key.

This is an example of putting unfamiliar numbers in perspective.

So, we were then interested to see if there would be an easy rule of thumb for people to calculate how many calories per day they should be eating, so that they could re-express foods as a percentage of that.

We found some calorie guidelines on the Web published by the US government. With the help of Jake Hofman, we used Hadley Wickham‘s rvest package to scrape them and his other tools to process and plot them.

The result is above. If you have any ideas on how to fit it elegantly, let us know.

We tried a number of fits. Lines are good for heuristics, so we made a bi-linear fit to the raw data (in points). We’re all grownups reading this blog, so let’s focus on the lines to the right of the peak.

Click to enlarge

Time to make the heuristics. For women, you need about 65 fewer calories per day for every decade after age 20. For men, you need about 105 fewer calories per day for every decade after age 20. Or let’s just say 70 and 100 to keep it simple.

So, if you have an opposite sex life partner (OSLP?), keep in mind that you may need to cut back by more or fewer calories as the person across the table as you age together. Same sex life partner (SSLP?), cut back the same amount. Just don’t go beyond the range of the chart. The guidelines suggest even sedentary men shouldn’t eat fewer than 2,000 calories a day at any age. For women, that number is 1650.

REFERENCES

Barrio, Pablo J., Daniel G. Goldstein, & Jake M. Hofman. (2016). Improving comprehension of numbers in the news. ACM Conference on Human Factors in Computing Systems (CHI ’16). [Download]

The R code, below, has some other attempts at plots in it. You may be most interested in it as a way to see rvest in action. Or just to get the data.

The post How many calories should you eat per day? appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Playing Around with Methods Overloading, C-language and Operators (1)

Wed, 2016-07-06 14:10

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

This post was originally posted on Quantide blog. Read the full article here.

Introduction

R is an object-oriented (OO) language. This basically means that R is able to recognize the type of objects generate from analysis and to apply the right operation on different objects.

For example, the summary(x) method performs different operations depending on the so-called “class” of x:

x <- data.frame(first=1:10, second=letters[1:10]) class(x) # An object of class "data.frame"

## [1] "data.frame"

summary(x)

## first second ## Min. : 1.00 a :1 ## 1st Qu.: 3.25 b :1 ## Median : 5.50 c :1 ## Mean : 5.50 d :1 ## 3rd Qu.: 7.75 e :1 ## Max. :10.00 f :1 ## (Other):4

ds <- data.frame(x=1:100, y=100 + 3 * 1:100 + rnorm(100,sd = 10)) md <- lm(formula = y ~ x, data = ds) class(md) # An object of class "lm"

## [1] "lm"

summary(md)

## ## Call: ## lm(formula = y ~ x, data = ds) ## ## Residuals: ## Min 1Q Median 3Q Max ## -20.1695 -5.8434 -0.4058 5.3611 27.9861 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 100.07892 1.99555 50.15 <2e-16 *** ## x 2.98890 0.03431 87.12 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 9.903 on 98 degrees of freedom ## Multiple R-squared: 0.9873, Adjusted R-squared: 0.9871 ## F-statistic: 7590 on 1 and 98 DF, p-value: < 2.2e-16

the outputs reported, and the calculations performed, by the summary() method are really different for the x and the mdobjects.

This behavior is one of characteristics of OO languages, and it is called methods overloading.

The methods overloading can be applied not only for methods in “function form” (i.e., for methods like summary()), but also for operators; indeed, “behind the scenes”, the operators are functions/methods. For example, if we try to write + in the R console we obtain:

`+`

## function (e1, e2) .Primitive("+")

That means that the + operator actually is a function/method that requires two arguments: e1 and e2, which are respectively the left and right argument of operator itself.

The + operator is present in R base, and can be overloaded as well, like in ggplot2 package, where the + operator is used to “build” the graph characteristics, as in following example:

require(ggplot2) prds <- predict(object = md, interval = "prediction", level=.9) ds <- cbind(ds, prds) ds$outliers <- as.factor(ds$y<ds$lwr | ds$y>ds$upr) graph <- ggplot(data = ds,mapping = aes(x=x, y=y, color=outliers)) graph <- graph + geom_point() graph <- graph + geom_line(aes(y=fit), col="blue") graph <- graph + geom_line(aes(y=lwr), col="green") graph <- graph + geom_line(aes(y=upr), col="green") graph <- graph + ggtitle("Regression and 90% prediction bands") print(graph)

 

The + operator, then, is applied differently with ggplot2 objects (with respect other object types), where it “concatenates” or “assembles” parts of final graph.

In this small post, and following ones, I would like to produce some “jokes” with objects, operators, overloading, and similar “oddities”..

C-language += operator and its emulation in R

In C language there are several useful operators that allow the programmer to save some typing and to produce some more efficient and easiest to read code. The first operator that I would like to discuss is the += one.

+= is an operator that does operations like: a = a + k.
In C, the above sentence can be summarized with a += k. Of course, the sentence can be something of more complex, like

a += (x-log(2))^2

In this case, the code line shall be “translated” to a = a + (x-log(2))^2.

If I would like to have in R a new operator that acts similarly to C’s +=, I whould have to create it.

Unfortunately, not all the names are allowed in R for new operators: if I want to produce a new operator name I can only use operators with names like %mynewoperator% where the % symbols are mandatory.

Indeed, for this example, I will create a new %+=% operator that acts similarly to the C’s +=.

This new operator has to be able to get the values of variables passed as arguments, to sum them, and then, more importantly, to update the value of the first variable with the new value.

Continue reading on Quantide blog

The post Playing Around with Methods Overloading, C-language and Operators (1) appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Data Shape Transformation With Reshape()

Wed, 2016-07-06 13:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

reshape() is an R function that accesses “observations” in grouped dataset columns and “records” in dataset rows, in order to programmatically transform the dataset shape into “long” or “wide” format.

Required dataframe:
data1 <- data.frame(id=c("ID.1", "ID.2", "ID.3"),
sample1=c(5.01, 79.40, 80.37),
sample2=c(5.12, 81.42, 83.12),
sample3=c(8.62, 81.29, 85.92))

Answers to the exercises are available here.

Exercise 1
Wide-to-Long:
Using the reshape() parameter “direction=“, “varying=” columns are stacked according to the new records created by the “idvar=” column.

Therefore, convert “data1” to long format, by stacking columns 2 through 4. The new row names are from column “id“. The new time variable is called, “TIME“. The column name of the stacked data is called “Sample“. Set a new dataframe variable called, “data2“.

Exercise 2
Long-to-Wide:
Use direction="wide" to convert “data2” back to the shape of “data1“. Setting a new variable isn’t needed. (Note that rownames from “data2” are retained.)

Exercise 3
Time Variables:
Script a reshape() operation, where “timevar=” is set to the variable within “data2” that differentiates multiple records.

Exercise 4
New Row Names:
Script a reshape() operation, where “data2” is converted to “wide” format, and “new.row.names=” is set to unique “data2$id” names.

Exercise 5
Convert “data2” to wide format. Set “v.names=” to the “data2” column with observations.

Exercise 6
Set sep = "" in order to reshape “data1” to long format.

Exercise 7
Reshape “data2” to “wide“. Use the “direction =” parameter. Setting a new dataframe variable isn’t required.

Exercise 8
Use the most basic reshape command possible, in order to reshape
“data2” to wide format.

Exercise 9
Reshape “data2” to “wide“, with column names for the reshaped data of “TIME” and “Sample“.

Exercise 10
Reshape “data1” by varying “sample1“, “sample2“, and “sample3“.

Image by Andreas Bauer (Own work) [CC-BY-SA-2.5], via Wikimedia Commons.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

The history of R’s predecessor, S, from co-creator Rick Becker

Wed, 2016-07-06 11:40

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Before there was R, there was S. R was modeled on a language developed at AT&T Bell Labs starting in 1976 by Rick Becker and John Chambers (and, later, Alan Wilks) along with Doug Dunn, Jean McRae, and Judy Schilling.

At last week's useR! conference, Rick Becker gave a fascinating keynote address, Forty Years of S. His talk recounts the history of S's genesis and development, using just 3MB of disk on an Interdata 8/32. Rick's talk includes numerous tidbits that explain many characteristics of R, including the philosophy behind the graphics system and the origin of the arrow <- assignment operator in R. The story is also coloured with anecdotes from various other luminaries at Bell Labs at the time, including John Tukey (the pioneer of exploratory data analysis and the inventor of the words "software" and "bit"), and Kernighan and Ritchie (who were upstairs designing Unix and the C language at the same time S was being developed).

Here's Rick's talk, with an introduction by Trevor Hastie. (Many thanks to Microsoft for recording and making the video available.)

 

For more on the history of S, see this interview with another of the creators of S, John Chambers

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Build your own offshore company

Wed, 2016-07-06 11:25

(This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

Hackathons are not alike

Recently, a number of this blog’s authors were at a data hackathon, the strangest one we’ve been to so far. It was more of a startup pitch gathering, complete with pitch training and whatnot. I was repeatedly asked by other participants “so, how do you want to monetise your idea?”. My answer was simple: I don’t. I already have a job.

The topic of the hackathon was sufficiently vague (“ideas on economy, media, news, content, social media, digitilisation, knowledge management, content creation and distribution, and media transformation”), so we decided we wanted to take a look at the recently published Panama Papers data.

Since the entire setup for the project is in R – the data preparation, the analysis, the dashboard, and even the presentation – I think it might be of interest to other people of the R community, hence this blogpost. For those impatient, here’s the direct link to the shiny app, and you can ignore my rambings: https://safferli.shinyapps.io/hackathon_shiny/. I will only post the most intersting code snippets, the full code (for all preparatory data hacking, the shiny dashboard, and the presentation) is available on github, as always.

Panama Papers

For those that have been living under a rock in April, the Panama Papers are a leak of data from Mossack Fonseca, a law firm in Panama specialising in setting up offshore companies. It is so far the largest leak (source: The Economist) in history with 2.6TB of (original) data, and 11.5 million documents. It contains references to 29 billionaires (as ranked in Forbes), and 12 current or former country leaders.

The data on the network of firms and persons has recently been made available, and contains roughly 320 thousand companies. The hard work of cleaning the raw data and putting it into a useful format (.csv and a graph database format (neo4j) are provided) was done by newspaper companies and journalists world-wide.

Random Name Generator

My initial impulse to take a look at the data more closely came when a friend of mine and I realised that a lot of the company names in the Panama Papers actually kind of funny – and that it is extremely clear that nothing reputable can be expected from these companies. For instance, there is a “Moonlight Import/Export”, and a “You’ll See Ltd.”.

So, we agreed that we’d build a random name generator for offshore companies, using the most popular company names from the Panama papers. To make things more “web-two-zero-y”, we would build one of these “modern Facebook viral thingies”, where you can generate the name from your birth date – e.g. your day of birth will always point to a fixed name part. We built a nice infographic for this, which is also available on github to download.

How do we get to the names? Quite easily. We first grab the names from provided data, use the excellent tidyr::separate to split the names into their parts (e.g. “My Company” becoming “My” and “Company”), remove a couple of common stopwords and then pick the 31 most prevalent company name parts.

# read company data provided by the Panama Papers Entities <- read_csv(paste0(csv_folder, "Entities.csv")) Entities %<>% # get lower-case names mutate(n = name %>% tolower) %>% # get last "word" of name for corporation form (e.g. "ltd.") mutate(form = gsub(".* ", "", n)) # split the names into one column per "word" Entities %<>% separate(n, paste0("n", 1:20), fill = "right") # stopwords -- we don't want these in our companies stop <- c(letters, "the", "com", "and", "of", "int", "pty", "samoa", "sdn", "europe", "ptc", "") # first "word" of the company name n <- "n1" filter_criteria <- lazyeval::interp(~ ! col %in% stop & ! col %in% Entities$form, col = as.name(n)) first <- Entities %>% group_by_(.dots = n) %>% summarize(n = n()) %>% filter_(filter_criteria) %>% arrange(-n) %>% .[[n]] %>% head(31) # birthdates: 31 days in a month, get top 31 paste(1:31, "=", first, "\n") %>% cat

This we do for the first three parts of the company name, and voilà, the company name generator is done!

Shiny dashboard

The next logical step was to build an app for this.

  1. randomly generate a name, until you’re satisfied with the result
  2. pick an intermediary – the person who will help you set up the offshore company
    • we’ll let you pick a country you want to start with, and then show the top five intermediaries, ranked by who opened up the most offshore companies
    • for convenience, we add a google maps link to the intermediary’s address, so you can fire up your GPS easily
    • check out intermediaries_from_country.R on github
  3. pick a jurisdiction for the offshore company by showing pictures of the available beaches
    • beach pictures were taken from the first result of a google image search of "$countryname+beach", with some fine tuning by hand (the US beach had girls in bikinis, which we did not think were appropriate for such a serious business proposal).
    • check out jurisdiction_images.R on github

All output is parsed into a fluidRow() UI element, step by step adding a new “row” to the dashboard.

The Shiny app is available on shinyapps.io, on the free plan, so it will be a bit slow: https://safferli.shinyapps.io/hackathon_shiny/. Head on over and build your own offshore company!

Conclusion

Ironically, we won the “Best Pitch” award for our project, even though we didn’t really pitch anything, and I also did not participate in the pitch training. If you speak German and are not put off by a shaky handcam recording, you can check out my presentation here. My talk starts at 1hr15min, roughly.

Writing this article, I just found out that there is a Panama Papers R package on github. So there you go, you can also get the data the easy way!

Code and data for this analysis is available on github, as always.

Build your own offshore company was originally published by Kirill Pomogajko at Opiate for the masses on July 06, 2016.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Geographic data to service the needs of a remote employee – part2

Wed, 2016-07-06 05:02

(This article was first published on Mango Solutions » R Blog, and kindly contributed to R-bloggers)

Ava Yang, Mango Solutions

Recap

In part 1 of this post I set out to find a flat to rent based on three simple criteria:

  • Café density
  • Tube station density
  • Monthly rent

So far I have made use of the baidumap and REmap packages to create a nice visualisation of available flats and coffee shops in Shanghai.

Calculation and scoring

Now let’s do some basic math and programming. Three measures ( derived from original variables to quantify my preferences.

For density of café and tube station, the closer the better; the more the better. Geographic distances were calculated by function distm from package geosphere.

library(dplyr) library(geosphere) library(knitr) library(baidumap) load('data/ziroom.rds') # raw data load("data/sh_cafe.rds") load("data/sh_station.rds") # 1. Generate names to represent flats # 2. Extract longitude and lattitude sh_ziroom <- ziroom %>% mutate(name=paste("Room", rownames(ziroom), sep="_")) %>% mutate(lon=getCoordinate(flat, city="上海", formatted = T)[, 'longtitude']) %>% mutate(lat=getCoordinate(flat, city="上海", formatted = T)[, 'latitude']) %>% na.omit() %>% select(c(lon, lat, name, price_promotion, flat)) # distance matrices: between cafe and flat, between station and flat dist_cafe_flat <- distm(sh_cafe[,c("lon", "lat")], sh_ziroom[,c("lon", "lat")]) %>% as.data.frame() dist_station_flat <- distm(sh_station[,c("lon", "lat")], sh_ziroom[,c("lon", "lat")]) %>% as.data.frame()

As an upper limit I’m willing to walk as far as 750 metres (about 0.5 mile) from a café. Thus, cafeidx and stationidx were then given by

For this job I wrote a small custom function called calIdx.

# Function to calculate cafe_idx and station_idx calIdx <- function(tmpcol) { tmpcol <- tmpcol[which(tmpcol < 750)] return(sum(1/log(tmpcol))) }

Rent is a negative indicator, and so rentidx could be obtained from

The weighted score was calculated by

# 1. cafeIdx = 1/log(dis1) + 1/log(dis2) +...+ 1/log(disN) # 2. stationIdx = 1/log(dis1) + 1/log(dis2) +...+ 1/log(disN) # 3. rentIdx = 1/log(price_promotion) # 4. score = 0.3*cafeIdx + 0.2*stationIdx + 0.5*rentIdx sh_ziroom_top10 <- sh_ziroom %>% mutate(cafeIdx = sapply(dist_cafe_flat, calIdx)) %>% mutate(stationIdx = sapply(dist_station_flat, calIdx)) %>% filter(price_promotion <= 4000) %>% mutate(rentIdx = 1/log(as.numeric(price_promotion))) %>% mutate(score = 0.4*cafeIdx + 0.2*stationIdx + 0.4*rentIdx) %>% arrange(desc(score)) %>% slice(1:10)

Summary

kable(sh_ziroom_top10[, c("name", "score", "cafeIdx", "stationIdx", "rentIdx")], align="c")   name score cafeIdx stationIdx rentIdx Room_34 0.6480966 1.3262957 0.3380904 0.1249006 Room_35 0.6470510 1.3262957 0.3380904 0.1222865 Room_80 0.6054141 1.2216344 0.3378458 0.1229781 Room_79 0.6048128 1.2216344 0.3378458 0.1214746 Room_22 0.5729428 1.1430015 0.3349634 0.1218737 Room_24 0.5729428 1.1430015 0.3349634 0.1218737 Room_45 0.5292036 0.9617378 0.4709076 0.1258173 Room_46 0.5284566 0.9617378 0.4709076 0.1239499 Room_59 0.4334636 0.8012006 0.3237803 0.1205684 Room_57 0.3836545 0.6721137 0.3302977 0.1218737

Done! See above for the top 10 room candidates. The mechanism I used is not difficult and makes my life so much easier. Moving to a new area which fulfils all my social needs is no longer such a big challenge!

 

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions » R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Resume & Interview Tips For R Programmers

Wed, 2016-07-06 04:04

(This article was first published on Articles – ProgrammingR, and kindly contributed to R-bloggers)

Speaking as a hiring manager, it doesn’t take much to stand out as a candidate for a statistical programming job. We just finished hiring the last of several analyst positions for a new data science unit at my day job. The final round was surprisingly less competitive that I expected; many of the candidates either failed to prepare or made basic mistakes in the job search process.

In the interests of helping others, here are few resume and interview tips that could have improved their chances.

1 – Google me (and my company) 

This one is basic, but I was shocked by the volume of candidates who didn’t even bother to learn about the company or the hiring manager.

This is surprisingly easy to fix and is a good first step in establishing yourself as a serious candidate.

First, the recruiter almost always tells you that you will be meeting with Person X from Company Y, who is in Industry Z. Fantastic, feed those three pieces of information into Google or LinkedIn and see what comes up. Usually, you get some great stuff like:

  • The Hiring Manager’s LinkedIn profile: full of great information about their past and gives hints to subjects which interest them. This gives clues about things your interviewer is interested in; for example, I mention Python and the Python user group. Telling me about your Python projects would probably get you some extra credibility (just saying….)
  • Company Profile: At a minimum, read their LinkedIn or Crunch base profile and take a peek at their website; read their most recent couple of press releases (these are listed in the “News” section of Google Finance).

If you’ve got the time,  look for information about how companies in that industry are using data science or technology. Google <industry name> and “data science” and “case study”. You’ll usually find a couple of articles about projects other people have done. Think about how we might apply them at my company; these are great interview topics (“I read Company XYZ is using text analysis to data mine customer comments, what do you guys think about this?”) for the dreaded “do you have any questions for me” portion of our conversation.

Humility is important here – the hiring manager knows a lot more about their industry and company that you do. But a little creative thinking can transform our interview from a painful conversation about “your greatest weakness” to a more exciting conversation about what you could do if you got the job. Guess which candidate I’m going to hire….

2 – Focus Job Descriptions On Unique Lessons / Accomplishments

Give the hiring managers a little credit. Most entry level jobs are very similar across companies. Business analysts are generally asked to gather customer requirements. Project Managers hold meetings. Developers and Statistical Programmers write code and tests. In fact, many time we mentally boil this down to a candidate having X years of experience, sorted into buckets (analytics / technical / business).

Instead of doing a recital of your everyday duties on your resume, highlight the top few things you accomplished in that position. Write a short summary of what you did and what the project either accomplished or learned. Pay close attention to anything which is different or potentially interesting to a hiring manager. Which of your accomplishments would your manager brag about to their (non-technical) VP?

For example, I expect any statistical analyst has “extracted and transformed data” and “updated standard reporting”. What would catch my eye is someone describing how they analyzed a marketing program (identified best segments to promote) or figured out how to speed up DNA sequencing  or moved a bunch of standard reporting to a self-service website. These type of accomplishment sets your resume apart.

Trust me, if you can tell me a cool story about how you made things better, I’ll assume you can probably handle the customer paperwork.

3 – Don’t List Technical Skills You Don’t Know (Very Well)

This particular topic has ascended to the coveted status of ‘pet peeve’. Every technical resume contains a section which lists the computer languages and packages that you would like me to believe you can use. That last point is crucial to the success of this section.

There is a misconception that having a massive list of technologies on your resume is a good thing. For most jobs and candidates, it isn’t. The reality is that my team has standardized around a couple of core technologies (R, Python, SQL) and anyone joining the group will be required to learn any of the missing technologies PLUS our environment PLUS our data. So there’s a tiny number of perfect unicorns roaming around out there who can know everything on Day One and a much larger number of decent candidates who know most of the package and demonstrate the ability to easily bridge any gaps. Most sane hiring managers are aware of this (and are fine with it).

I get a warm fuzzy feeling you can bridge the gaps when we talk about technologies that you’ve mastered, where you can list significant projects that involved that technology and discuss the details of the project. I don’t care if they were work projects or personal projects. In fact, since the professional projects I’m involved in (marketing and pricing data science) are covered by big confidentiality agreements, I often use examples from my side project (a word game site) for technical discussions. Nobody cares if I share how to build a scrabble cheat. And many managers will give credit for expertise in a similar space. If you can master SAS, you can probably figure out R fairly quickly.

I do not, however, get that same warm fuzzy feeling when you indicate that the only exposure you had to a programming language that you list on your resume is an online course and you’ve never actually used it for a serious project. Especially if you can’t answer basic questions about the core concepts of the language. And please, if you don’t have significant recent practical experience in a technology, don’t dress it up with verbiage indicating you’re “proficient”. The interview is pretty much over once I discover a gap between your resume and reality.

There’s also a question of focus. The more stuff you list, the harder it is for your audience to understand what you’re actually good at. If you boil it down to a couple of highly relevant “preferred technologies”, a hiring manager will know exactly what you’re bringing to the table. You’re also communicating you’re serious about mastering that particular technology. Scrap the fluff skills and talk about your projects.

In summation, don’t put any technical skill on your resume without being prepared to demonstrate significant commitment to applying it.

 

4 – Don’t Oversell Your Online Classes

Sadly, that online class isn’t really a compelling signal you have technical skills.

First, taking an online course in data science or coding isn’t unique anymore. Most of my entry-level candidate pool claimed some form of online education or independent learning. Furthermore, they rarely provide an employer with an objective measure of technical aptitude. They do demonstrate that you’re interested in the craft, although I already expect that since you applied for the position.

Now – if you took that knowledge and applied it to create a useful project, that quickly flips the script. This can be anything – a website, a useful module or open-source contribution, a tutorial, or an interesting piece of data analysis posted on your blog. Our conversation will shift from a generic discussion of “the latest online course” to a more unique discussion of what you were able to accomplish with the tool. Plus you earn points for being a self-taught developer.

 

 

Relax and Have Fun

One common trait for all of our successful candidates was they were able to show our team they were genuinely interested in the mission we were asking them to perform. They spoke enthusiastically about what they could accomplish if we gave them an opportunity. This combination of technical expertise and interest in the role was what got them hired.

Hopefully you’re reading this article because you like the craft of R Programming and data science. So think about your next interview in that sense; not as some weird HR ritual to be endured, but an opportunity speak with the manager about how you can practice our craft. Bring your enthusiasm for R to the meeting and brainstorm with the manager about how you can use it to help them.

That will get you hired!

The post Resume & Interview Tips For R Programmers appeared first on ProgrammingR.

To leave a comment for the author, please follow the link and comment on their blog: Articles – ProgrammingR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

7 new R jobs from around the world (2016-07-05)

Tue, 2016-07-05 14:43

Here are the new R Jobs for 2016-07-05.

To post your R job on the next post

Just visit this link and post a new R job to the R community. You can either post a job for free (which works great), or pay $50 to have your job featured (and get extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

New Featured Jobs
More New Jobs
  1. Full-Time
    Senior Financial Data Analyst for Amazon @ Seattle
    Amazon – Posted by Jacqui Hull
    Seattle
    Washington, United States
    1 Jul2016
  2. Full-Time
    Sr. Data Scientist for Amazon @ Seattle
    Amazon – Posted by Jacqui Hull
    Seattle
    Washington, United States
    1 Jul2016
  3. Full-Time
    bioinformatics analyst
    peter.shepard
    Anywhere
    1 Jul2016
  4. Full-Time
    Data Scientist / Quantitative Analyst
    Sporting Data Limited – Posted by sportingdata
    London
    England, United Kingdom
    27 Jun2016
  5. Full-Time
    Senior Data Scientist
    Global Strategy Group – Posted by datanorms
    New York
    New York, United States
    20 Jun2016
  6. Part-Time
    problem solver
    IdeaConnection LTD – Posted byNKVanHerwaarden
    Anywhere
    17 Jun2016
  7. Full-Time
    Data Analyst @ Los Angeles, California, United States
    VPS, LLC – Posted by gsotocampos
    Los Angeles
    California, United States
    16 Jun2016

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 200 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

 

(you may also look at previous R jobs posts).

Categories: Methodology Blogs

Creating inset maps using spatial objects

Tue, 2016-07-05 12:20

(This article was first published on R – jannesm, and kindly contributed to R-bloggers)

A while ago Arnold explained in his post how to create an inset map using ggplot2. This is great but I have to admit I rarely use ggplot2 in combination with spatial data. Instead I often find myself using the plot functions provided by the raster- and sp-package. To create an inset map with these plot methods, we have to slightly adjust Arnold’s code.

First of all, let us attach some packages and data.

# attach packages library("sp") library("raster") library("grid") library("gridBase") library("TeachingDemos") library("rworldmap") library("RColorBrewer") library("classInt") # attach country polygons data(countriesLow) cous <- countriesLow # find the Netherlands net <- cous[which(cous@data$NAME == "Netherlands"), ] # load meuse.riv data(meuse.riv) # convert to SpatialPolygons riv <- SpatialPolygons(list(Polygons(list(Polygon(meuse.riv)), ID = "1"))) # meuse dataset data(meuse) coordinates(meuse) <- c("x", "y") proj4string(meuse) <- CRS("+init=epsg:28992") # classifying cadmium into 5 classes q_5 <- classIntervals(meuse@data$cadmium, n = 5, style = "fisher") pal <- brewer.pal(5, "Reds") my_cols <- findColours(q_5, pal) # we also need lat/lon coordinates meuse_tr <- spTransform(meuse, proj4string(net))

Next, I create the main plot and subsequently add the inset map.

# create the figure png(file = "meuse.png", w = 1800, h = 1800, res = 300) plot.new() vp_1 <- viewport(x = 0, y = 0, width = 0.91, height = 1, just = c("left", "bottom")) vp_2 <- viewport(x = 0.61, y = 0.19, width = 0.22, height = 0.25, just = c("left", "bottom")) # main plot pushViewport(vp_1) par(new = TRUE, fig = gridFIG()) plot(raster::crop(riv, bbox(meuse) + c(-500, -1000, 2000, 2000)), axes = TRUE, col = "lightblue", xlim = c(178500, 182000), ylim = c(329000, 334000)) plot(meuse, col = "black", bg = my_cols, pch = 22, add = TRUE) legend("topleft", fill = attr(my_cols, "palette"), legend = names(attr(my_cols, "table")), bty = "n") upViewport() # inset map pushViewport(vp_2) par(new = TRUE, fig = gridFIG(), mar = rep(0, 4)) # plot the Netherlands and its neighbors plot(cous[net, ], xlim = c(4.2, 5.8), ylim = c(50, 53.7), col = "white", bg = "transparent") plot(net, col = "lightgray", add = TRUE) # add the study area location points(x = coordinates(meuse_tr)[1, 1], y = coordinates(meuse_tr)[1, 2], cex = 1.5, pch = 15) shadowtext(x = coordinates(meuse_tr)[1, 1] - 0.35, y = coordinates(meuse_tr)[1, 2] - 0.1, labels = "study \n area", font = 3) dev.off()

 

 

To leave a comment for the author, please follow the link and comment on their blog: R – jannesm. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

tibble 1.1

Tue, 2016-07-05 11:50

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re proud to announce version 1.1 of the tibble package. Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not. Grab the latest version with:

install.packages("tibble")

There are three major new features:

  • A more consistent naming scheme
  • Changes to how columns are extracted
  • Tweaks to the output

There are many other small improvements and bug fixes: please see the release notes for a complete list.

A better naming scheme

It’s caused some confusion that you use data_frame() and as_data_frame() to create and coerce tibbles. It’s also more important to make the distinction between tibbles and data frames more clear as we evolve a little further away from the semantics of data frames.

Now, we’re consistently using “tibble” as the key word in creation, coercion, and testing functions:

tibble(x = 1:5, y = letters[1:5]) #> # A tibble: 5 x 2 #> x y #> <int> <chr> #> 1 1 a #> 2 2 b #> 3 3 c #> 4 4 d #> 5 5 e as_tibble(data.frame(x = runif(5))) #> # A tibble: 5 x 1 #> x #> <dbl> #> 1 0.4603887 #> 2 0.4824339 #> 3 0.4546795 #> 4 0.5042028 #> 5 0.4558387 is_tibble(data.frame()) #> [1] FALSE

Previously tibble() was an alias for frame_data(). If you were using tibble() to create tibbles by rows, you’ll need to switch to frame_data(). This is a breaking change, but we believe that the new naming scheme will be less confusing in the long run.

Extracting columns

The previous version of tibble was a little too strict when you attempted to retrieve a column that did not exist: we had forgotten that many people check for the presence of column with is.null(df$x). This is bad idea because of partial matching, but it is common:

df1 <- data.frame(xyz = 1) df1$x #> [1] 1

Now, instead of throwing an error, tibble will return NULL. If you use $, common in interactive scripts, tibble will generate a warning:

df2 <- tibble(xyz = 1) df2$x #> Warning: Unknown column 'x' #> NULL df2[["x"]] #> NULL

We also provide a convenient helper for detecting the presence/absence of a column:

has_name(df1, "x") #> [1] FALSE has_name(df2, "x") #> [1] FALSE Output tweaks

We’ve tweaked the output to have a shorter header, more information in the footer. We’re using # consistently to denote metadata, and we print missing character values as <NA> (instead of NA).

The example below shows the new rendering of the flights table.

nycflights13::flights #> # A tibble: 336,776 x 19 #> year month day dep_time sched_dep_time dep_delay arr_time #> <int> <int> <int> <int> <int> <dbl> <int> #> 1 2013 1 1 517 515 2 830 #> 2 2013 1 1 533 529 4 850 #> 3 2013 1 1 542 540 2 923 #> 4 2013 1 1 544 545 -1 1004 #> 5 2013 1 1 554 600 -6 812 #> 6 2013 1 1 554 558 -4 740 #> 7 2013 1 1 555 600 -5 913 #> 8 2013 1 1 557 600 -3 709 #> 9 2013 1 1 557 600 -3 838 #> 10 2013 1 1 558 600 -2 753 #> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>, #> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, #> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, #> # minute <dbl>, time_hour <time>

Thanks to Lionel Henry for contributing an option for determining the number of printed extra columns: getOption("tibble.max_extra_cols"). This is particularly important for the ultra-wide tables often released by statistical offices and other institutions.

Expect the printed output to continue to evolve. In the next version, we hope to do better with very wide columns (e.g. from long strings), and to make better use of now unused horizontal space (e.g. from long column names).

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

eRum 2016: First European conference for the programming language R

Tue, 2016-07-05 11:00

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

The European R community has new place to go for networking and exchanging experiences. For the first time, the Poznan University of Economics and Business hosts the European R users meeting (eRum) in Poland from 12 to 14 October. 250 participants from business and science are expected at the event in Poznan.


eRum 2016: European R users meeting

Meeting of R heroes

The topics of the eRum 2016 reflect the various methods and fields of application of the programming language. From Bayesian statistics to visualization and from bioinformatics to finance and economics – the workshops and presentations of the eRum take up a large number of topics relevant to R users. With Rasmus Bååth, Romain Francois and Ulrike Grömping some of the leading European R developers have already confirmed their appearance as speakers at the event. Consequently, the eRum 2016 is almost fully booked with more than 200 registrations and at its premiere already seems to become the European equivalent to the long established useR! Conference.

eoda supports the eRum 2016 as sponsor

The data science specialist eoda has been counting on R for years and now supports the eRum 2016 as a sponsor. “The positive development of R in recent years is based on the dedication of a strong community. We are happy that R users will have a new platform for exchanging knowledge and we regard the eRum as an important component for the further distribution of R in Europe,” eoda Chief Data Scientist Oliver Bracht explains the involvement of the Kassel-based company.

More information about the eRum 2016 and the registration can be found here: http://erum.ue.poznan.pl/.

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

httr 1.2.0

Tue, 2016-07-05 10:45

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

httr 1.2.0 is now available on CRAN. The httr package makes it easy to talk to web APIs from R. Learn more in the quick start vignette. Install the latest version with:

install.packages("httr")

There are a few small new features:

  • New RETRY() function allows you to retry a request multiple times until it succeeds, if you you are trying to talk to an unreliable service. To avoid hammering the server, it uses exponential backoff with jitter, as described in https://www.awsarchitectureblog.com/2015/03/backoff.html.
  • DELETE() gains a body parameter.
  • encode = "raw" parameter to functions that accept bodies. This allows you to do your own encoding.
  • http_type() returns the content/mime type of a request, sans parameters.

There is one important bug fix:

  • No longer uses use custom requests for standard POST requests. This has the side-effect of properly following redirects after POST, fixing some login issues in rvest.

httr 1.2.1 includes a fix for a small bug that I discovered shortly after releasing 1.2.0.

For the complete list of improvements, please see the release notes.

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

xml2 1.0.0

Tue, 2016-07-05 10:41

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We are pleased to announced that xml2 1.0.0 is now available on CRAN. Xml2 is a wrapper around the comprehensive libxml2 C library, and makes it easy to work with XML and HTML files in R. Install the latest version with:

install.packages("xml2")

There are three major improvements in 1.0.0:

  1. You can now modify and create XML documents.
  2. xml_find_first() replaces xml_find_one(), and provides better semantics for missing nodes.
  3. Improved namespace handling when working with XPath.

There are many other small improvements and bug fixes: please see the release notes for a complete list.

Modification and creation

xml2 now supports modification and creation of XML nodes. This includes new functions xml_new_document(), xml_new_child(), xml_new_sibling(), xml_set_namespace(), xml_remove(), xml_replace(), xml_root(), and replacement methods for xml_name(), xml_attr(), xml_attrs() and xml_text().

The basic process of creating an XML document by hand looks something like this:

root <- xml_new_document() %>% xml_add_child("root") root %>% xml_add_child("a1", x = "1", y = "2") %>% xml_add_child("b") %>% xml_add_child("c") %>% invisible() root %>% xml_add_child("a2") %>% xml_add_sibling("a3") %>% invisible() cat(as.character(root)) #> <?xml version="1.0"?> #> <root><a1 x="1" y="2"><b><c/></b></a1><a2/><a3/></root>

For a complete description of creation and mutation, please see vignette("modification", package = "xml2").

xml_find_first()

xml_find_one() has been deprecated in favor of xml_find_first(). xml_find_first() now always returns a single node: if there are multiple matches, it returns the first (without a warning), and if there are no matches, it returns a new xml_missing object.

This makes it much easier to work with ragged/inconsistent hierarchies:

x1 <- read_xml("<a> <b></b> <b><c>See</c></b> <b><c>Sea</c><c /></b> </a>") c <- x1 %>% xml_find_all(".//b") %>% xml_find_first(".//c") c #> {xml_nodeset (3)} #> [1] <NA> #> [2] <c>See</c> #> [3] <c>Sea</c>

Missing nodes are replaced by missing values in functions that return vectors:

xml_name(c) #> [1] NA "c" "c" xml_text(c) #> [1] NA "See" "Sea" XPath and namespaces

XPath is challenging to use if your document contains any namespaces:

x <- read_xml(' <root> <doc1 xmlns = "http://foo.com"><baz /></doc1> <doc2 xmlns = "http://bar.com"><baz /></doc2> </root> ') x %>% xml_find_all(".//baz") #> {xml_nodeset (0)}

To make life slightly easier, the default xml_ns() object is automatically passed to xml_find_*():

x %>% xml_ns() #> d1 <-> http://foo.com #> d2 <-> http://bar.com x %>% xml_find_all(".//d1:baz") #> {xml_nodeset (1)} #> [1] <baz/>

If you just want to avoid the hassle of namespaces altogether, we have a new nuclear option: xml_ns_strip():

xml_ns_strip(x) x %>% xml_find_all(".//baz") #> {xml_nodeset (2)} #> [1] <baz/> #> [2] <baz/>

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs