R bloggers

Syndicate content
R news and tutorials contributed by (580) R bloggers
Updated: 2 hours 14 min ago

How to choose the right tool for your data science project

Thu, 2016-09-22 12:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Brandon Rohrer, Principal Data Scientist, Microsoft

R or Python? Torch or TensorFlow? (or MXNet or CNTK)? Spark or map-reduce?

When we're getting started on a project, the mountain of tools to choose from can be overwhelming. Sometimes it makes me feel small and bewildered, like Alice in Wonderland. Luckily, the Cheshire Cat cut to the heart of the problem:

“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where–” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
“–so long as I get SOMEWHERE,” Alice added as an explanation.
“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”
(Alice’s Adventures in Wonderland, Chapter 6)

The first step to choosing your tools is to choose a goal. Make it clear and keep it firmly in mind.

That’s most of the work. After that there are a few other things to consider and traps to watch out for, but you’re 90% of the way there. Some tools fit some tasks better than others, so it’s just a matter of finding a match. The rest of the details are in this blog post, but if you just let your goal drive your choices, you can’t go far wrong.

Best of luck on your next project!

Data Science and Robots Blog: Which tool should I use?

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Introducing the R Data Science Livestream

Thu, 2016-09-22 11:37

(This article was first published on Data, Evidence, and Policy - Jared Knowles, and kindly contributed to R-bloggers)

Have you ever watched a livestream? Have you ever wondered what the actual minute to minute of doing data science looks like? Do you wonder if other R users have the same frustrations as you? If yes — then read on!

I’m off on a new professional adventure where I am doing public facing work for the first time in years. While working at home the other day I thought it would be a great idea to keep myself on-task and document my decisions if I recorded myself working with my webcam. Then, I thought, why stop there — why not livestream my work? 

And thus, the R Data Science Livestream was born. The idea is that every day for an hour or two I will livestream myself doing some data science tasks related to my current project — which is to analyze dozens of years of FBI Uniform Crime reports (read more). I haven’t done much R coding in the last 4 months, so it’s also a good way to shake off the rust of being out of the game for so long.

So if you are at all interested or curious why someone would do this, check out the landing page I put up to document the project and if you are really curious, maybe even tune in or watch the archives on YouTube!

 

To leave a comment for the author, please follow the link and comment on their blog: Data, Evidence, and Policy - Jared Knowles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R Markdown: How to number and reference tables

Thu, 2016-09-22 08:55

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

R Markdown is a great tool to make research results reproducible. However, in scientific research papers or reports, tables and figures usually need to be numbered and referenced. Unfortunately, R Markdown has no “native” method to number and reference table and figure captions. The recently published bookdown package makes it very easy to number and reference tables and figures (Link). However, since bookdown uses LaTex functionality, R Markdown files created with bookdown cannot be converted into MS Word (.docx) files.

In this blog post, I will explain how to number and reference tables and figures in R Markdown files using the captioner package.

Packages required

The following code will install load and / or install the R packages required for this blog post. The dataset I will be using in this blog post is named bundesligR and part of the bundesligR package. It contains “all final tables of Germany’s highest football league, the Bundesliga” (Link).

if (!require("pacman")) install.packages("pacman") pacman::p_load(knitr, captioner, bundesligR, stringr)

In the first code snippet, we create a table using the kable function of the knitr package. With caption we can specify a simple table caption. As we can see, the caption will not be numbered and, thus, cannot be referenced in the document.

German Bundesliga: Final Table 2015/16, Position 1-6

Position Team Points GD 1 FC Bayern Muenchen 88 63 2 Borussia Dortmund 78 48 3 Bayer 04 Leverkusen 60 16 4 Borussia Moenchengladbach 55 17 5 FC Schalke 04 52 2 6 1. FSV Mainz 05 50 4 Table numbering

Thanks to Alathea Letaw’s captioner package, we can number tables and figures.
In a first step, we define a function named table_nums and apply it to the tables’ name and caption. Furthermore, we may also define a prefix (Tab. for tables and Fig. for figures).

table_nums <- captioner::captioner(prefix = "Tab.") tab.1_cap <- table_nums(name = "tab_1", caption = "German Bundesliga: Final Table 2015/16, Position 7-12") tab.2_cap <- table_nums(name = "tab_2", caption = "German Bundesliga: Final Table 2015/16, Position 12-18")

The next code snippet combines both inline code and a code chunk. With fig.cap = tab.1_cap, we specify the caption of the first table. It is important to separate inline code and code chunk. Otherwise the numbering won’t work.

Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12

Position Team Points GD 7 Hertha BSC 50 0 8 VfL Wolfsburg 45 -2 9 1. FC Koeln 43 -4 10 Hamburger SV 41 -6 11 FC Ingolstadt 04 40 -9 12 FC Augsburg 38 -10 Table referencing

Since we have received a numbered table, it should also be possible to reference the table. However, we can not just use the inline code

table_nums('tab_1')

.

Otherwise, we wi’ll get the following output:

[1] “Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12”

In order to return the desired output (prefix Tab. and table number), I have written the function f.ref. Using a regular expression, the function returns all characters of the table_nums('tab_1') output located before the first colon.

f.ref <- function(x) { stringr::str_extract(table_nums(x), "[^:]*") }

When we apply this function to tab_1, the inline code returns the following result:

Inline code: As we can see in f.ref("tab_1"), the Berlin based football club Hertha BSC had position seven in the final table.

Result: As we can see in Tab. 1, the Berlin based football club Hertha BSC had position seven in the final table.

Just to make the table complete, Tab. 2 shows positions 13 to 18 of the final Bundesliga table.

knitr::kable(bundesligR::bundesligR[c(13:18), c(2,3,11,10)], align = c('c', 'l', 'c', 'c'), row.names = FALSE)

Tab. 2: German Bundesliga: Final Table 2015/16, Position 12-18

Position Team Points GD 13 Werder Bremen 38 -15 14 SV Darmstadt 98 38 -15 15 TSG 1899 Hoffenheim 37 -15 16 Eintracht Frankfurt 36 -18 17 VfB Stuttgart 33 -25 18 Hannover 96 25 -31 And what about figures?

Figures can be numbered and referenced following the same principle.

I hope you find this post useful and If you have any question please post a comment below. You are welcome to visit my personal blog Scripts and Statistics for more R tutorials.

    Related Post

    1. RDBL – manipulate data in-database with R code only
    2. R Markdown: How to format tables and figures in .docx files
    3. R Markdown: How to insert page breaks in a MS Word document
    4. Working on Data-Warehouse (SQL) with R
    5. Implementing Apriori Algorithm in R

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Paired t-test in R Exercises

    Wed, 2016-09-21 12:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    The paired samples t test is used to check if there are any differences in the mean of the same sample at two different time points. For example a medical researcher collects data on the same patients before and after a therapy. A paired t test will show if the therapy improves patient outcomes.

    There are several assumptions that need to be satisfied so that results of a paired t test are valid. They are listed below

    • The measured variable is continuous
    • The differences between the two groups are approximately normally distributed
    • We should not have any outliers in our data
    • An adequate sample size is required

    For this exercise we will use the anorexia data set available in package MASS. The data set contains weights of girls before and after anorexia treatment. Our interest is to know if the treatment caused any change in weight.

    Solutions to these exercises can be found here

    Exercise 1

    Load the data and inspect its structure

    Exercise 2

    Generate descriptive statistics on weight before treatment

    Exercise 3

    Generate descriptive statistics on weight after treatment

    Exercise 4

    Create a new variable that contains the differences in weight before and after treatment

    Exercise 5

    Create a boxplot to identify any outliers

    Exercise 6

    Create a histogram with a normal curve to visually inspect normality

    Exercise 7

    Perform a normality test on the differences

    Exercise 8

    Perform a power analysis to assess sample adequacy

    Exercise 9

    Perform a paired t test

    Exercise 10

    Interpret the results

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Welcome to the Tidyverse

    Wed, 2016-09-21 11:22

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Hadley Wickham, co-author (with Garrett Grolemund) of R for Data Science and RStudio's Chief Scientist, has focused much of his R package development on the un-sexy but critically important part of the data science process: data management. In the Tidy Tools Manifesto, he proposes four basic principles for any computer interface for handling data:

    1. Reuse existing data structures.

    2. Compose simple functions with the pipe.

    3. Embrace functional programming.

    4. Design for humans.

    Those principles are realized in a new collection of his R packages: the tidyverse. Now, with a simple call to library(tidyverse) (after installing the package from CRAN), you can load a suite of tools to make managing data easier into your R session:

    • readr, for importing data from files
    • tibble, a modern iteration on data frames
    • tidyr, functions to rearrange data for analysis
    • dplyr, functions to filter, arrange, subset, modify and aggregate data frames

    The tidyverse also loads purrr, for functional programming with data, and ggplot2, for data visualization using the grammar of graphics.

    Installing the tidyverse package also installs for you (but doesn't automatically load) a raft of other packages to help you work with dates/time, strings, factors (with the new forcats package), and statistical models. It also provides various packages for connecting to remote data sources and data file formats.

    Simply put, tidyverse puts a complete suite of modern data-handling tools into your R session, and provides an essential toolbox for any data scientist using R. (Also, it's a lot easier to simply add library(tidyverse) to the top of your script rather than the dozen or so library(…) calls previously required!) Hadley regularly updates these packages, and you can easily update them in your R installation using the provided tidyverse_update() function.

    For more on tidyverse, check out Hadley's post on the RStudio blog, linked below.

    RStudio Blog: tidyverse 1.0.0

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    A Fun Gastronomical Dataset: What’s on the Menu?

    Tue, 2016-09-20 18:00

    (This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)


    I just found a fun food themed dataset that I’d never heard about and that I thought I’d share. It’s from a project called What’s on the menu where the New York Public Library has crowdsourced a digitization of their collection of historical restaurant menus. The collection stretches all the way back to the 19th century and well into the 1990’s, and on the home page it is stated that there are “1,332,271 dishes transcribed from 17,545 menus”. Here is one of those menus, from a turn of the (old) century Chinese-American restaurant:

    The data is freely available in csv format (yay!) and here I ‘ll just show how to the get the data into R and I’ll use it to plot the popularity of some foods over time.

    First we’re going to download the data, “unzip” csv files into a temporary directory, and read them into R.

    library(tidyverse) library(stringr) library(curl) # This url changes every month, check what's the latest at http://menus.nypl.org/data menu_data_url <- "https://s3.amazonaws.com/menusdata.nypl.org/gzips/2016_09_16_07_00_30_data.tgz" temp_dir <- tempdir() curl_download(menu_data_url, file.path(temp_dir, "menu_data.tgz")) untar(file.path(temp_dir, "menu_data.tgz"), exdir = temp_dir) dish <- read_csv(file.path(temp_dir, "Dish.csv")) menu <- read_csv(file.path(temp_dir, "Menu.csv")) menu_item <- read_csv(file.path(temp_dir, "MenuItem.csv")) menu_page <- read_csv(file.path(temp_dir, "MenuPage.csv"))

    The resulting tables together describe the contents of the menus, but in order to know which dish was on which menu we need to join together the four tables. While doing this we’re also going to remove some uninteresting columns and remove some records that were not coded correctly.

    d <- menu_item %>% select( id, menu_page_id, dish_id, price) %>% left_join(dish %>% select(id, name) %>% rename(dish_name = name), by = c("dish_id" = "id")) %>% left_join(menu_page %>% select(id, menu_id), by = c("menu_page_id" = "id")) %>% left_join(menu %>% select(id, date, place, location), by = c("menu_id" = "id")) %>% mutate(year = lubridate::year(date)) %>% filter(!is.na(year)) %>% filter(year > 1800 & year <= 2016) %>% select(year, location, menu_id, dish_name, price, place)

    What we are left with in the d data frame is a table of what dishes were served, where they were served and when. Here is a sampler:

    d[sample(1:nrow(d), 10), ]

    # A tibble: 10 × 6 year location menu_id dish_name price <dbl> <chr> <int> <chr> <dbl> 1 1900 Fifth Avenue Hotel 25394 Broiled Mutton Kidneys NA 2 1971 Tadlich Grill 26670 Mixed Green 0.85 3 1939 Maison Prunier 30325 Entrecote Minute NA 4 1914 The Beekman Café Co. 33898 Camembert cheese 0.10 5 1900 Carlton Hotel Company 21865 Pork Chops 0.15 6 1914 Gutmann's Café and Restaurant 33982 Cold Boiled Ham with Potato Salad 0.40 7 1912 Waldorf-Astoria 34512 Stuffed Figs and Dates 0.30 8 1933 Hotel Astor 31262 Assorted Small Cakes 0.25 9 1933 Ambassador Grill 31291 Stuffed celery 0.55 10 1901 Del Coronado Hotel 14512 peaches NA # ... with 1 more variables: place <chr>

    Personally I’d go for the Stuffed Figs and Dates at the Waldorf-Astoria followed by some Assorted Small Cakes 21 years later at the Astor. If you want to download this slightly processed version of the dataset it’s available here in csv format. We can also see which are the most common menu items in the dataset:

    d %>% count(tolower(dish_name)) %>% arrange(desc(n)) %>% head(10)

    # A tibble: 10 × 2 `tolower(dish_name)` n <chr> <int> 1 coffee 8532 2 celery 4865 3 olives 4737 4 tea 4682 5 radishes 3426 6 mashed potatoes 2999 7 boiled potatoes 2502 8 vanilla ice cream 2379 9 chicken salad 2306 10 milk 2218

    That coffee is king isn’t that surprising, but the popularity of celery seems weird. My current hypothesis is that “celery” often refers to some kind of celery salad, or maybe it was common as a snack in the New York area in the 1900s. It should be remembered that the dataset does not represent what people ate in general, but is based on what menus were collected by the New York public library (presumably from the New York area). Also the bulk of the menus are from between 1900 and 1980:

    ggplot(d, aes(year)) + geom_histogram(binwidth = 5, center = 1902.5, color = "black", fill = "lightblue") + scale_y_continuous("N.o. menu items")

    Even though it’s not completely clear what the dataset represents we could still have a look at some food trends over time. Below I’m going to go through a couple of common foodstuffs and, for each decennium, calculate what proportion of menus includes that foodstuff.

    d$decennium = floor(d$year / 10) * 10 foods <- c("coffee", "tea", "pancake", "ice cream", "french frie", "french peas", "apple", "banana", "strawberry") # Above I dropped the "d" in French fries in order # to also match "French fried potatoes." food_over_time <- map_df(foods, function(food) { d %>% filter(year >= 1900 & year <= 1980) %>% group_by(decennium, menu_id) %>% summarise(contains_food = any(str_detect(dish_name, regex(food, ignore_case = TRUE)), na.rm = TRUE)) %>% summarise(prop_food = mean(contains_food, na.rm = TRUE)) %>% mutate(food = food) })

    First up, Coffee vs. Tea:

    # A reusable list of ggplot2 directives to produce a lineplot food_time_plot <- list( geom_line(), geom_point(), scale_y_continuous("% of menus include",labels = scales::percent, limits = c(0, NA)), scale_x_continuous(""), facet_wrap(~ food), theme_minimal(), theme(legend.position = "none")) food_over_time %>% filter(food %in% c("coffee", "tea")) %>% ggplot(aes(decennium, prop_food, color = food)) + food_time_plot

    Both pretty popular menu items, but I’m not sure what to make of the trends… Next up Ice cream vs. Pancakes:

    food_over_time %>% filter(food %in% c("pancake", "ice cream")) %>% ggplot(aes(decennium, prop_food, color = food)) + food_time_plot

    Ice cream wins, but again I’m not sure what to make of how ice cream varies over time. Maybe it’s just an artifact of how the data was collected or maybe it actually reflects the icegeist somehow. What about French fries vs. French peas:

    food_over_time %>% filter(food %in% c("french frie", "french peas")) %>% ggplot(aes(decennium, prop_food, color = food)) + food_time_plot

    Seems like the heyday of French peas are over, but French fries also seemed to peak in the 40s… Finally let’s look at some fruit:

    food_over_time %>% filter(food %in% c("apple", "banana", "strawberry")) %>% ggplot(aes(decennium, prop_food, color = food)) + food_time_plot

    Banana has really dropped in menu popularity since the early 1900s…

    Anyway, this is a really cool dataset and I barely scratched the surface of what could be done with it. If you decide to explore this dataset further, and you make some plots and/or analyses, do send me a link and I will link to it here.

    To finish off let’s look at this elegant cocktail menu from 1937 which, among cocktails and fizzes, advertises tiny cocktail tamales:

    To leave a comment for the author, please follow the link and comment on their blog: Publishable Stuff. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    One year of R / Notes

    Tue, 2016-09-20 18:00

    (This article was first published on R / Notes, and kindly contributed to R-bloggers)

    My collection of R notes is now slightly over one year old. This note reflects on how useful the exercise of blogging about R has been so far, and answers some of the questions that I have received about it.

    Blogging about R

    I created my collection of R notes with the intention to keep track of technical notes that I often need to refer to when I work with other people.

    My notes are usually very simple, at least at the mathematical level: my math skills are “read-only” skills, and I have nothing of interest to “showcase” in that area. Still, most of my R notes are more technical in nature than the kind of blog posts that I write for my academic blog, which I write in French for an audience of social scientists.

    Writing up “R / Notes” has forced me to simplify the code that I use. My impression is that, when I write code for others to read through, I like to streamline the code as much as possible, using pipes and as few R packages and lines of code as possible.

    I also have the habit to use single-letter names for objects and to create as few of these objects as possible, but that is probably as much of a bad habit as a good one. I can trace that habit to many years ago, when I used to write TI-BASIC code in high school…

    R-Bloggers syndication

    As the header of this collection suggests, my R notes are syndicated on the R-Bloggers aggregator, where most other existing R blogs are also syndicated.

    Thanks to Tal Galili for maintaining R-Bloggers, and for his help with syndicating this blog despite the fact that it is built on a customized version of the (discontinued?) Dropplets static blog engine that produces slightly weird Atom and RSS feeds.

    Code embeds

    One question that has come up more than once about this collection of notes is: How do I embed the R code that shows up in the notes?

    The answer is that I use Gist, with a short bit of custom CSS to hide everything produced by its embed method, except for the code and its line numbers:

    The little bit of CSS code above corresponds to this (secret) Gist. The code to embed the Gist, which requires JavaScript, is shown at the top of the Gist.

    Math embeds

    Another question that has come up about this collection of notes is: How do I embed the math code that (occasionally) shows up in the notes?

    In order to be able to use mathematical notation in some of my notes, I have turned, like many others, to the fantastic MathJax library, which brings the power of LaTeX typesetting to the Web.

    To leave a comment for the author, please follow the link and comment on their blog: R / Notes. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Linux Data Science Virtual Machine: new and upgraded tools

    Tue, 2016-09-20 16:18

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    The Linux edition of the Data Science Virtual Machine on Microsoft Azure was recently upgraded. The Linux DSVM includes Microsoft R, Anaconda Python, Jupyter, CNTK and many other data science and machine learning tools, new or upgraded for this release. This eWeek story gives an overview of the improvements, but the highlights are:

    • Microsoft R Server (developer edition) is now included. This includes the complete R distribution from CRAN, plus additional data-analysis functions with big-data capabilities, and the DeployR framework for integrating R into applications as a web service. (The developer edition is identical to the enterprise Microsoft R Server edition, but licensed for development/test use.).
    • JupyterHub is now included, allowing multiple users to collaborate on Jupyter Notebooks (including R and/or Python code) simultaneously.
    • A new data science language, Julia, is now included. You can program in Julia from the command line or from a Jupyter notebook. 

    The Windows edition of the Data Science VM has also been updated, and now includes SQL Server 2016 (developer edition) with R Services for in-database R processing.

    Both editions of the Data Science VM are available on Microsoft Azure in a variety of configurations of RAM, cores, and disk. There are no software costs; you pay only the hourly Azure infrastructure charge to use it. For more details on the improvements to the Data Science Virtual Machine, follow the link to the blog post below.

    Cortana Intelligence and Machine Learning Blog: Recent Updates to the Microsoft Data Science Virtual Machine 

     

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    10 new jobs for R users – from around the world (2016-09-20)

    Tue, 2016-09-20 16:15

     

    To post your R job on the next post

    Just visit this link and post a new R job to the R community. You can either post a job for free (which works great), or pay $50 to have your job featured (and get extra exposure).

    Current R jobs

    Job seekers: please follow the links below to learn more and apply for your R job of interest:

    Featured Jobs
    More New Jobs
    1. Full-Time
      Help with finding similarities between XY scatter plots and clustering of time series RNA Seq. data needed
      UALR – Posted by tfh4
      Anywhere
      20 Sep2016
    2. Full-Time
      R programmer and bioinformatician needed to help an almost blind bioinformatics graduate student
      UALR – Posted by tfh4
      Anywhere
      20 Sep2016
    3. Full-Time
      Data Manager II, CRP
      Boston Children’s Hospital – Posted by katiekerr7
      Boston
      Massachusetts, United States
      14 Sep2016
    4. Full-Time
      Research and Analytics Associate
      Hodges Ward Elliott – Posted by tkiely@hwerealestate.com
      New York
      New York, United States
      13 Sep2016
    5. Freelance
      Data Scientist for StartupMetrics @ New York
      StartupMetrics – Posted by hnisha.patel
      New York
      New York, United States
      10 Sep2016
    6. Full-Time
      Data Mining Analyst Fellow – fixed term 1 year
      Boston Scientific – Posted by glynburtt
      Hemel Hempstead
      England, United Kingdom
      5 Sep2016
    7. Full-Time
      Statistician
      USC Center for Applied Molecular Medicine – Posted byNaim Matasci
      Los Angeles
      California, United States
      2 Sep2016
    8. Full-Time
      Data Modeller & Analyst @ Manchester, UK
      Hello Soda – Posted by LeanneFitzpatrick
      Manchester
      England, United Kingdom
      31 Aug2016
    9. Full-Time
      Lead Data Scientist for Lumosity @ San Francisco, California, U.S.
      Lumos Labs – Posted by lumoslabs
      San Francisco
      California, United States
      30 Aug2016
    10. Full-Time
      Environmental Specialist, Data Analyst @ Saint Petersburg, Florida, United States
      Florida Fish and Wildlife Conservation Commission, Fish and Wildlife Research Institute – Posted by rhardy
      Saint Petersburg
      Florida, United States
      26 Aug2016

    In R-users.com you can see all the R jobs that are currently available.

    R-users Resumes

    R-users also has a resume section which features CVs from over 200 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

    (you may also look at previous R jobs posts).

    Categories: Methodology Blogs

    R for Beginners and other R courses | Milan

    Tue, 2016-09-20 12:44

    (This article was first published on MilanoR, and kindly contributed to R-bloggers)

    R for Beginners is Quantide first course of the Fall term. It will take place in Legnano (Milan) on October 3 and 4.

    This course is intended for beginners: no previous R knowledge is needed.

    Outline
    • Basics of R: installation, resources, packages, objects
    • Data import and export
    • Data manipulation using dplyr
    • Plotting and graphics using ggplot
    • Basics of data modelling

    The course is open to max 6 attendees. FAQ, detailed program and tickets

    R for Beginners is organized by the R training and consulting company Quantide and it is taught in Italian. The course materials are in English.

    If you want to know more about Quantide, check out Quantide’s website.

    Location

    Legnano is about 30 min by train from Milano. Trains from Milano to Legnano are scheduled every 30 minutes, and Quantide premises are 3 walking minutes from Legnano train station.

    Price 

    Euro 400 + VAT

    The cost includes lunch, comprehensive course materials + 1 hour of individual online post course support for each student within 30 days from course date. We offer an academic discount for those engaged in full time studies or research.

     

    Other R courses | Autumn term

    October 17-18: Efficient Data Manipulation with R. Handle every kind of Data Management task, using the most modern R tools: tidyr, dplyr and lubridate. Even with backend databases.  Reserve now

    October 25-26: Statistical Models with R. Develop a wide variety of statistical models with R, from the simplest Linear Regression to the most sophisticated GLM models. Reserve now

    November 7-8: Data Mining with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now

    November 15-16: R for Developers. Move forward from being a R user to become a R developer. Discover the R working mechanisms and master your R programming skills. Reserve now

    For further information contact us at training@quantide.com

    The post R for Beginners and other R courses | Milan appeared first on MilanoR.

    To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Relative error distributions, without the heavy tail theatrics

    Mon, 2016-09-19 20:18

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic idea you bring to analysis, it is a likely fact about the world that comes at you). The article is a good example of how to plot and reason about such situations.

    I am just going to add a few additional references (mostly from Nina) and some more discussion on log-normal distributions versus Zipf-style distributions or Pareto distributions.

    The theory

    In analytics, data science, and statistics we often assume we are dealing with nice or tightly concentrated distributions such as the normal or Gaussian distribution. Analysis tends to be very easy in these situation and not require much data. However, for many quantities of interest (wealth, company sizes, sales, and many more) it becomes obvious that we cannot be dealing with such a distribution. The telltale sign is usually when relative error is more plausible than absolute error. For example it is much more plausible we know our net worth to within plus or minus 10% than to within plus or minus $10.

    In such cases you have to deal with the consequences of slightly more wild distributions such as at least the log-normal distribution. In fact this is the important point and I suggest you read Nina’s article for motivation, explanation, and methods. We have found this article useful both in working with data scientists and in working with executives and other business decision makers. The article formalizes ideas all of these people already “get” or anticipate into concrete examples.

    In addition to trying to use mathematics to make things more clear, there is a mystic sub-population of mathematicians that try to use mathematics to make things more esoteric. They are literally disappointed when things make sense. For this population it isn’t enough to see if switching from a normal to log-normal distribution will fix the issues in their analysis. They want to move on to even more exotic distributions such as Pareto (which has even more consequences) with or without any evidence of such a need.

    The issue is: in a log-normal distribution we see rare large events much more often than in a standard normal distribution. Modeling this can be crucial as it tells us not to be lulled into to strong a sense of security by small samples. This concern can be axiomatized into “heavy tailed” or “fat tailed” distributions, but be aware: these distributions tend to be more extreme than what is implied by a relative error model. The usual heavy tail examples are Zipf-style distributions or Pareto distributions (people tend to ignore the truly nasty example the Cauchy distribution, possibly because it dates back the 17th century and thus doesn’t seem hip).

    The hope seems to be that one is saving the day by brining in new esoteric or exotic knowledge such as fractal dimension or Zipf’s law. The actual fact is this sort of power-law structure has been know for a very long time under many names. Here are some more references:

    Reading these we see that the relevant statistical issues have been well known since at least the 1920’s (so were not a new discovery by the later loud and famous popularizers). The usual claim of old wine in new bottles is that there is some small detail (and mathematics is a detailed field) that is now set differently. To this I put forward a quote from Banach (from Adventures of a Mathematician S.M. Ulam, University of California Press, 1991, page 203):

    Good mathematicians see analogies between theorems or theories, the very best ones see analogies between analogies.

    Drowning in removable differences and distinctions is the world of the tyro, not the master.

    From Piantadosi we have:

    The apparent simplicity of the distribution is an artifact of how the distribution is plotted. The standard method for visualizing the word frequency distribution is to count how often each word occurs in a corpus, and sort the word frequency counts by decreasing magnitude. The frequency f(r) of the r’th most frequent word is then plotted against the frequency rank r, yielding typically a mostly linear curve on a log-log plot (Zipf, 1936), corresponding to roughly a power law distribution. This approach— though essentially universal since Zipf—commits a serious error of data visualization. In estimating the frequency-rank relationship this way, the frequency f(r) and frequency rank r of a word are estimated on the same corpus, leading to correlated errors between the x-location r and y-location f(r) of points in the plot.

    An Example

    Let us work through this one detailed criticism using R (all synthetic data/graphs found here). We start with the problem and a couple of observations.

    Suppose we are running a business and organize our sales data as follows. We compute what fraction of our sales each item is (be it a count, or be it in dollars) and then rank them (item 1 is top selling, item 2 is next, and so on).

    The insight of the Pareto-ists and Zipfians is if we plot sales intensity (probability or frequency) as a function of sales rank we are in fact very likely to get a graph that looks like the following:

    Instead of all items selling at the same rate we see the top selling item can often make up a signficant fraction of the sales (such as 20%). There are a lot of 80/20 rules based on this empirical observation.

    Notice also the graph is fairly illegible, the curve hugs the axes and most of the visual space is wasted. The next suggestion is to plot on “log-log paper” or plot the logarithm of frequency as a function of logarithm of rank. That gives us a graph that looks like the following:

    If the original data is Zipfian distributed (as it is in the artificial example) the graph becomes a very legible straight line. The slope of the line is the important feature of the distribution and is (in a very loose sense) the “fractal dimension” of this data. The mystics think that by identifying the slope you have identified some key esoteric fact about the data and can then somehow “make hay” with this knowledge (though they never go on to explain how).

    Chris Anderson in his writings on the “long tail” (including his book) clearly described a very practical use of such graphs. Suppose instead of assuming the line on log-log plots is a consequence of something special, suppose it is a consequence of something mundane. Maybe graphs tend to look like this for catalogs, sales, wealth, company sizes, and so on. So instead of saying the perfect fit is telling us something, look at defects in fit. Perhaps they indicate something. For example: suppose something we are selling products online and something is wrong with a great part of our online catalogue. Perhaps many of the products don’t have pictures, don’t have good descriptions, or some other common defect. We might expect our rank/frequency graph to look more like the following:

    What happened is after product 20 something went wrong. In this case (because the problem happened early at an important low rank) can see it, but it is even more legible on the log-log plot.

    The business advice is: look for that jump, sample items above and below the jump, and look for a difference. As we said the difference could be no images on such items, no free shipping, or some other sensible business impediment. The reason we care is this large population of low-volume items could represent a non-negligible fraction of sales. Below is the theoretical graph if we fixed whatever is wrong with the rarer items and plotted sales:

    From this graph we can calculate that the missing sales represent a loss of about 32% of revenue. If we could service these sales cheaply we would want them.

    The flaw in analysis

    In the above I used a theoretical Zipfian world to generate my example. But suppose the world isn’t Zipfian (there are many situations where log-normal is a much more plausible situation). Just because the analyst wishes things were exotic (requiring their unique heroic contribution) doesn’t mean they are in fact exotic. Log-log paper is legible because it reprocesses the data fairly violently. As Piantadosi said: we may see patterns in such plots that are features of the analysis technique, and not features of the world.

    Suppose the underlying sales dates is log-normal distributed instead of Zipfian distributed (a plausible assumption until eliminated). If we had full knowledge of every possible sale for all time we could make a log-log plot over all time and get the following graph.

    What we want to point out is: this is not a line. The hook down at the right side means that rare items have far fewer sales than a Zipfian model would imply. It isn’t just a bit of noise to be ignored. This means when one assumes a Zipfian model one is assuming the rare items as a group are in fact very important. This may be true or may be false, which is why you want to measure such a property and not assume it one way or the other.

    The above graph doesn’t look so bad. The honest empiricist may catch the defect and say it doesn’t look like a line (though obviously a quantitive test of distributions would also be called for). But this graph was plotting all sales over all time. We would never see that. Statistically we usually model observed sales as a sample drawn from this larger ideal sampling population. Let’s take a look at what that graph may look like. An example is given below.

    I’ll confess, I’d have a hard time arguing this wasn’t a line. It may or may not be a line, but it is certainly not strong evidence of a non-line. This data did not come from a Zipfian distribution (I know I drew it from a log-normal distribution), yet I would have a hard time convincing a Zipfian that it wasn’t from a Zipfian source.

    And this brings us back to Piantadosi’s point. We used the same sample to estimate both sales frequencies and sales ranks. Neither of those are actually known to us (we can only estimate them from samples). And when we use the same sample to estimate both, they necessarily come out very related due to the sampling procedure. Some of the biases seem harmless such as frequency monotone decreasing in rank (which is true for unknown true values). But remember: relations that are true in the full population are not always true in the sample. Suppose we had a peek at the answers and instead of estimating the ranks took them from the theoretical source. In this case we could plot true rank versus estimated frequency:

    This graph is much less orderly because we have eliminated some of the plotting bias which was introducing its own order. There are still analysis artifacts visible, but that is better than hidden artifacts. For example the horizontal strips are items that occurred with the same frequency in our sample, but had different theoretical ranks. In fact our sample is size 1000, so the rarest frequency we can measures is 1/1000 which creates the lowest horizontal stripe. The neatness of the previous graph were dots standing on top of each other as we estimated frequency as function of rank.

    We are not advocating specific changes, we are just saying the log-log plot is a fairly refined view, and as such many of its features are details of processing- not all correctly inferred or estimated features of the world. Again, for a more useful applied view we suggest Nina Zumel’s living in a log-normal world.

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    YaRrr! The Pirate’s (video) Guide to R

    Mon, 2016-09-19 15:49

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Today is Talk Like A Pirate Day, the perfect day to learn R, the programming language of pirates (arrr, matey!). If you have two-and-a-bit hours to spare, Nathaniel Phillips has created a video tutorial YaRrr! The Pirate's Guide to R which will take you through the basics: installation, basic R operations, and the matrix and data frame obects.

     

    For a more in-depth study of R, there's also a 250-page e-book YaRrr! The Pirate’s Guide to R which goes into the basics in more depth, and covers more advanced topics including data visualization, statistical analysis, and writing your own functions.

    There's also an accompanying package to the video and book called (appropriately) yarr that includes datasets from the course and also an interesting "Pirate Plot" data visualization that combines raw data, summary statistics, a "bean plot" distribution, and a confidence interval. 

    For more on The Pirate's Guide to R (and to tip him a beer), follow the link to Nathaniel's blog below.

    Nathaniel Phillips: YaRrr! The Pirate’s Guide to R

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    An R Function for Generating Authenticated URLs to Private Web Sites Hosted on AWS S3

    Mon, 2016-09-19 12:18

    (This article was first published on R – When Localhost Isn't Enough, and kindly contributed to R-bloggers)

    Quite often I want to share simple (static) web pages with other colleagues or clients. For example, I may have written a report using R Markdown and rendered it to HTML. AWS S3 can easily host such a simple web page (e.g. see here), but it cannot, however, offer any authentication to prevent anyone from accessing potentially sensitive information.

    Yegor Bugayenko has created an external service S3Auth.com that stands in the way of any S3 hosted web site, but this is a little too much for my needs. All I want to achieve is to limit access to specific S3 resources that will be largely transient in nature. A viable and simple solution is to use ‘query string request authentication’ that is described in detail here. I must confess to not really understanding what was going on here, until I had dug around on the web to see what others have been up to.

    This blog post describes a simple R function for generating authenticated and ephemeral URLs to private S3 resources (including web pages) that only the holders of the URL can access.

    Creating User Credentials for Read-Only Access to S3

    Before we can authenticate anyone, we need someone to authenticate. From the AWS Management Console create a new user, download their security credentials and then attach the AmazonS3ReadOnlyAccess policy to them. For more details on how to do this, refer to a previous post. Note, that you should not create passwords for them to access the AWS console.

    Loading a Static Web Page to AWS S3

    Do not be tempted to follow the S3 ‘Getting Started’ page on how to host a static web page and in doing so enable ‘Static Website Hosting’. We need our resources to remain private and we would also like to use HTTPS, which this option does not support. Instead, create a new bucket and upload a simple HTML file as usual. An example html file – e.g. index.html – could be,

    <!DOCTYPE html> <html> <body> <p>Hello, World!</p> </body> </html> An R Function for Generating Authenticated URLs

    We can now use our new user’s Access Key ID and Secret Access Key to create a URL with a limited lifetime that enables access to index.html. Technically, we are making a HTTP GET request to the S3 REST API, with the authentication details sent as part of a query string. Creating this URL is a bit tricky – I have adapted the Python example (number 3) that is provided here, as an R function (that can be found in the Gist below) – aws_query_string_auth_url(...). Here’s an example showing this R function in action:

    path_to_file <- "index.html" bucket <- "my.s3.bucket" region <- "eu-west-1" aws_access_key_id <- "DWAAAAJL4KIEWJCV3R36" aws_secret_access_key <- "jH1pEfnQtKj6VZJOFDy+t253OZJWZLEo9gaEoFAY" lifetime_minutes <- 1 aws_query_string_auth_url(path_to_file, bucket, region, aws_access_key_id, aws_secret_access_key, lifetime_minutes) # "https://s3-eu-west-1.amazonaws.com/my.s3.bucket/index.html?AWSAccessKeyId=DWAAAKIAJL4EWJCV3R36&Expires=1471994487&Signature=inZlnNHHswKmcPfTBiKhziRSwT4%3D"

    And here’s the code for it as inspired by the short code snippet here:

    Note the dependencies on the digest and base64enc packages.

    To leave a comment for the author, please follow the link and comment on their blog: R – When Localhost Isn't Enough. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Applying Functions To Lists Exercises

    Mon, 2016-09-19 12:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    The lapply() function applies a function to individual values of a list, and is a faster alternative to writing loops.

    Structure of the lapply() function:
    lapply(LIST, FUNCTION, ...)

    The list variable used for these exercises:
    list1 <- list(observationA = c(1:5, 7:3), observationB=matrix(1:6, nrow=2))

    Answers to the exercises are available here.

    Exercise 1

    Using lapply(), find the length of list1‘s observations.

    Exercise 2

    Using lapply(), find the sums of list1‘s observations.

    Exercise 3

    Use lapply() to find the quantiles of list1.

    Exercise 4

    Find the classes of list1‘s sub-variables, with lapply().

    Exercise 5

    Required function:
    DerivativeFunction <- function(x) { log10(x) + 1 }

    Apply the “DerivativeFunction” to list1.

    Exercise 6

    Script the “DerivativeFunction” within lapply(). The dataset is list1.

    Exercise 7

    Find the unique values in list1.

    Exercise 8

    Find the range of list1.

    Exercise 9

    Print list1 with the lapply() function.

    Exercise 10

    Convert the output of Exercise 9 to a vector, using the unlist(), and lapply(), functions.

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    ggtree for outbreak data

    Mon, 2016-09-19 08:10

    (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

    OutbreakTools implements basic tools for the analysis of Disease Outbreaks.

    It defines S4 class obkData to store case-base outbreak data. It also provides a function, plotggphy, to visualize such data on the phylogenetic tree.

    library(OutbreakTools) data(FluH1N1pdm2009) attach(FluH1N1pdm2009) x <- new("obkData", individuals = individuals, dna = FluH1N1pdm2009$dna, dna.individualID = samples$individualID, dna.date = samples$date, trees = FluH1N1pdm2009$trees) plotggphy(x, ladderize = TRUE, branch.unit = "year", tip.color = "location", tip.size = 3, tip.alpha = 0.75)

    As I mentioned in the post, ggtree for microbiome data, ggtree fits the R ecosystem in phylogenetic analysis. It serves as a general tools for annotating phylogenetic tree with different associated data from various sources. obkData object is also supported by ggtree and outbreak data stored in the object can be used to annotate the tree using grammar of graphics supported by ggtree.

    library(ggtree) ggtree(x, mrsd="2009-09-30", as.Date=TRUE) + geom_tippoint(aes(color=location), size=3, alpha=.75) + scale_color_brewer("location", palette="Spectral") + theme_tree2(legend.position='right')

    We can also associate the tree with other type of data that may come from experiments or evolution inference and use them to annotate the tree as demonstrated in the online vignettes.

    Citation

    G Yu, DK Smith, H Zhu, Y Guan, TTY Lam*. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution. doi:10.1111/2041-210X.12628.

    To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Learning Statistics on Youtube

    Mon, 2016-09-19 06:41

    (This article was first published on Statistics, R & etc. - Flavio Azevedo, and kindly contributed to R-bloggers)

    Youtube.com is the second most accessed website in the world (surpassed only by its parent, google.com). It has a whopping 1 billion unique views a month. [1, 2] It is a force to be reckoned with. In the video sharing platform, there are many brilliant and hard-working content creators producing high-quality and free educational videos that students and academics alike can enjoy. I made a survey on Youtube content that could be useful for those interested in learning Statistics, and I listed and categorized them below.

    Truth be told, this post is a glorified Google search in many respects. In any case, I had intended for a long time to gather this information as to facilitate the often laborious task of finding pertinent resources for learning statistical science in a non-static format (i.e., videos) that is easily accessible, high-quality, instructive and free. 

    Another motivation had to do with my teaching obligations. In this fall, I will teach a graduate course in Stats with R. To this end, I considered becoming a content creator myself, as to allow students to access the course’s content from the convenience of their homes. In this process, I found some excellent statistical courses on Youtube. Some were really useful in terms of their organization, others in terms of content, interesting explanations, pedagogical skills, availability of materials, etc. Altogether, searching for resources was a very instructive experience, whose fruits should be shared.

    Importantly, in this process, I learned that youtube is not short of ‘introductory course on ___.’ Not of Statistics, Probability or R, anyways. Which is a good thing. And often, you even see these three together. Also in abundance, are courses on the ABC’s of probability theory, classical statistics (i.e., up to ANOVA, ANCOVA), and on basics of applied statistics (e.g., Econometrics, Biostatistics, and Machine Learning). Indeed, Machine Learning (mostly through Data Science) is really well represented on Youtube.

    Due to the sheer amount of channels, I organized them into three broad categories: use of R as statistical software, use of other statistical software, and lecture format only. I also listed each channel’s content/topic, whether authors provided slides, code, additional materials online (with links), and relevant remarks.

    1. Learning Statistics with R




     
        Youtube channel

        Content

        Software Online Materials? Remarks

      Mike Marin [Intro] Basic Stats in R R Yes, good materials University British Columbia Michael Butler [Intro] to R and Stats, Modern R Yes Good intro to R + Exercises EZLearn [Intro] Basic Stats in R R Exercises w/ solutions Renegade Thinking: Courtney Brown [Intro] Undergraduate Stats R Yes Good Lectures Barton Poulson [Intro] Classical Stats, Programming & Solved Exercises R, Python, SPSS Yes Gives intro to Python, R, SPSS and launching an OLP Ed Boone [Intro] Basic R and SAS R & SAS Yes Lynda.com [Intro] Basics of R and Descriptives R Yes OLP Bryan Craven [Intro] Basic Stats in R R No – Laura Suttle (Revolution) [Intro] R tour for Beginners R No – Phil Chan [Intro] Classical and Bio-stats R, SPSS, Eviews No – Gordon Anthony Davis [Intro] R Programming Tutorial R No Thorough intro for beginners David Langer [Intro] Basics of R R No Excellent pedagogical Skills MrClean1796 [Intro] Math, Physics and Statistics, lecture & R R No – Brian Caffo Advanced & Bio-Stats, Reproducible Research R Yes, Coursera and GitHub Professor of Bio-statistics, Johns Hopkins Univ. James Scott Advanced Stats R Yes, and GitHub Several Course Materials on GitHub Derek Kane Machine Learning R Yes Excellent Videos, Fourier Analysis, Time series Forecasting DataCamp Programming, DataViz, R Markdown [free] R Yes, paid. 9$ for studentsMaria Nattestad DataViz R Personal Website Plotting in R for Biologists Christoph Scherber Mixed, GLM, GLS, Contrasts R Yes Librarian Womack Time Series, DataViz, BigData R Yes, Course and .R Materials online Jarad Niemi R Workflow, Bayesian, Statistical Inference R Yes Justin Esarey Bayesian, Categorical and Longitudinal Data, Machine Learning R Yes, lots and lots Political Scientist Jeromy Anglim Research Methods R Blog:Psych & Stats, GitHub + Rmeetups and Notes on Gelman, Carlin, Stern, and Rubin Erin Buchanan Under- & post-graduate Stats, SEM R, G*Power, Excel Yes Excellent pedagogical strategies Richard McElreath From Basic to Advanced Bayesian Stats R and Stan Yes, lots Book lectures edureka Data Science R, Hadoop, Python Yes, online learning plattaform R Intro w/ Hadoop [free] Learn R R programming, stats on webiste R, Python Yes, and One R Tip A Day On website, lots of starter’s code Data School Machine Learning, Data Manipulation (dplyr) Python, R Yes, dplyr Econometrics Academy Statistics (via Econometrics) R, STATA, SPSS Yes OLP, Excellent Materials and Resources Jalayer Academy Basic Stats + Machine Learning R, Excel No Also Lectures Michael Levy Authoring from R, Markdown, Shiny R No – Melvin L. Machine Learning, R Programming, PCA, DataViz R, Python, Gephi No Interesting Intro for Spark OpenIntroOrg Intro to Stats/R plus Inference, Linear Models, Bayesian R Yes, Coursera and OpenIntro Coursera Courses, Resources in SAS


     
    2. Learning Statistics with other software




     
        Youtube channel

        Content

        Software Online Materials? Remarks

      Jonathan Tuke Basic Stats Matlab No – Saiful Yusoff PLS, Intro to MaxQDA SmartPLS, MaxQDA Yes BYU James Gaskin SEM, PLS, Cluster SPSS, AMOS, SmartPLS Yes BYU Quantitative Specialists Basic Stats SPSS No Upbeat videos RStatsInstitute Basic Stats SPSS No Instructor at Udemy how2stats Basic Stats, lecture and software demonstrations SPSS Yes Complete Classical Stats BrunelASK Basic Stats SPSS – The Doctoral Journey Basic Stats SPSS YesStatisticsLectures Basic Stats, lecture format SPSS Yes discontinued, but thorough basic stats Andy Field Classical Stats, lecture and software demonstrations SPSS Yes, registration needed Used heavely in Social Sciences Quinnipiac University:Biostatistics Classical Stats SPSS No – The RMUoHP Biostatistics Basic and Bio-Stats SPSS, Excel No – PUB708 Team Classical Statistics SPSS, MiniTab No – Professor Ami Gates Classical Stats SPSS, Excel, StatCrunch Yes H. Michael Crowson Intro and Basic Stats in several Softare SPSS, STATA, AMOS, LISREAL Yes?Math Guy Zero Classical Stats + SEM SPSS, Excel, PLS No Lots of materials BayesianNetworks Bayesian Statistics, SEM, Causality BayesianLab YesKhan Academy Programming 101 Python YesMike’s SAS Short intro to SAS, SPSS SAS, SPSS No – Christian A. Wandeler Basic Stats PSPP No –



     
    3. Lectures on statistics




     
        Youtube channel

        Content

        Software Online Materials? Remarks

      Stomp On Step 1 [Intro] Bio-Stats, Basic Lectures Yes USMLE Khan Academy [Intro] Basic Stats, lecture format Lectures YesJoseph Nystrom [Intro] Basic Stats Lectures Yes Active & unorthodox teaching Statistics Learning Centre [Intro] Basic Stats Lectures Yes Register to access materials Brandon Foltz [Intro] Basic Stats Lectures soon Excellent visuals David Waldo [Intro] Probability Theory Lectures No – Andrew Jahn [Intro] Basic Stats Lectures No FSL, AFNI and SPM [Neuro-immaging] Professor Leonard [Intro] Stats and Maths Lectures No Excellent pedagogical skills ProfessorSerna [Intro] Basic Stats Lectures No – Victor Lavrenko Machine Learning, Probabilistic, Cluster, PCA, Mixture Models Lectures Yes, very complete Excellent Content, and lots of it Jeremy Balka’s Statistics Graduate-level Classical Stats, Lecture Lectures Yes, very thorough Excellent altogether, p-value vid great! Methods Manchester Uni Discussion on a wide variety of methods, SEM Lectures Yes Methods Fair Steve Grambow Series on Inference Lectures Yes Great Lectures on Inference [DUKE] Statistics Corner: Terry Shaneyfelt Statistical Inference Lectures Yes from a clinical perspective Michel van Biezen Complete Course of Stats Lectures Yes, 1, 2, 3 Thorough and complete, plus Physics and Maths Oxford Education Bayesian statistics: a comprehensive course Lectures Yes Nando de Freitas Machine Learning Lectures Yes, also here and here Alex Smola Machine Learning Lectures Yes, slides and code Abu (Abulhair) Saparov Machine Learning Lectures Yes Taught by Tom Mitchell and Maria-Florina Balcan Geoff Gordon Machine Learning, Optimization Lectures Yes MIT OpenCourseWare Probability Theory, Stochastic Processes Lectures Yes, here, and here Alexander Ihler Machine Learning Lectures Yes, along w/ many others classesRoyal Statistical Society Important Statistical issues Lectures Yes Interesting topics Ben Lambert Graduate and Advanced Stats Lectures No Asymptotic Behaviour of Estimators, SEM, EFA DeepLearning TV Machine (and Deep) Learning Lectures No Excellent pedagogical skills Mathematical Monk Machine Learning, and Probability Theory Lectures No –


     
    Final Remarks

    These collection of channels listed here are not supposed to be exhaustive. If I have neglected a youtube channel that you think should figure in this list, please let me know via the contact form below and I shall include it. Thank you very much!

    Open contact form

    Form Blog Youtube

    Name

    Name


    First Name


    Last Name

    Email Address

    Channel’s Name

    Channel’s Link

    Channel’s Content

    Optinal Message

    Thank you!

    To leave a comment for the author, please follow the link and comment on their blog: Statistics, R & etc. - Flavio Azevedo. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Running a model on separate groups

    Mon, 2016-09-19 06:19

    (This article was first published on blogR, and kindly contributed to R-bloggers)

    Ever wanted to run a model on separate groups of data? Read on!

    Here’s an example of a regression model fitted to separate groups: predicting a car’s Miles per Gallon with various attributes, but spearately for automatic and manual cars.

    library(tidyverse) library(broom) mtcars %>% nest(-am) %>% mutate(am = factor(am, levels = c(0, 1), labels = c("automatic", "manual")), fit = map(data, ~ lm(mpg ~ hp + wt + disp, data = .)), results = map(fit, augment)) %>% unnest(results) %>% ggplot(aes(x = mpg, y = .fitted)) + geom_abline(intercept = 0, slope = 1, alpha = .2) + # Line of perfect fit geom_point() + facet_grid(am ~ .) + labs(x = "Miles Per Gallon", y = "Predicted Value") + theme_bw()

     Getting Started

    A few things to do/keep in mind before getting started…

     A lot of detail for novices

    I started this post after working on a larger problem for which I couldn’t add detail about lower-level aspects. So this post is very detailed about a particular aspect of a larger problem and, thus, best suited for novice to intermediate R users.

     One of many approaches

    There are many ways to tackle this problem. We’ll cover a particular approach that I like, but be mindful that there are plenty of alternatives out there.

     The Tidyverse

    We’ll be using functions from many tidyverse packages like dplyr and ggplot2, as well as the tidy modelling package broom. If you’re unfamiliar with these and want to learn more, a good place to get started is Hadley Wickham’s R for Data Science. Let’s load these as follows (making use of the new tidyverse package):

    library(tidyverse) library(broom)  mtcars

    Ah, mtcars. My favourite data set. We’re gong to use this data set for most examples. Be sure to check it out if you’re unfamiliar with it! Run ?mtcars, or here’s a quick reminder:

    head(mtcars) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 #> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 #> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

    Let’s get to it.

     Nesting Tibbles

    Nested tibbles – sounds like some rare bird! For those who aren’t familiar with them, “tibbles are a modern take on data frames”. For our purposes here, you can think of a tibble like a data frame. It just prints to the console a little differently. Click the quote to learn more from the tibble vignette.

    So what do I mean by nested tibbles? Well, this is when we take sets of columns and rows from one data frame/tibble, and save (nest) them as cells in a new tibble. Make sense? No? Not to worry. An example will likley explain better.

    We do this with nest() from the tidyr package (which is loaded with library(tidyverse)). Perhaps the most common use of this function, and exactly how we’ll use it, is to pipe in a tibble or data frame, and drop one or more categorical variables using -. For example, let’s nest() the mtcars data set and drop the cylinder (cyl) column:

    mtcars %>% nest(-cyl) #> # A tibble: 3 × 2 #> cyl data #> <dbl> <list> #> 1 6 <tibble [7 × 10]> #> 2 4 <tibble [11 × 10]> #> 3 8 <tibble [14 × 10]>

    This looks interesting. We have one column that makes sense: cyl lists each of the levels of the cylinder variable. But what’s that data colum? Looks like tibbles. Let’s look into the tibble in the row where cyl == 4 to learn more:

    d <- mtcars %>% nest(-cyl) d$data[d$cyl == 4] #> [[1]] #> # A tibble: 11 × 10 #> mpg disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1 #> 2 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> 3 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2 #> 4 32.4 78.7 66 4.08 2.200 19.47 1 1 4 1 #> 5 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2 #> 6 33.9 71.1 65 4.22 1.835 19.90 1 1 4 1 #> 7 21.5 120.1 97 3.70 2.465 20.01 1 0 3 1 #> 8 27.3 79.0 66 4.08 1.935 18.90 1 1 4 1 #> 9 26.0 120.3 91 4.43 2.140 16.70 0 1 5 2 #> 10 30.4 95.1 113 3.77 1.513 16.90 1 1 5 2 #> 11 21.4 121.0 109 4.11 2.780 18.60 1 1 4 2

    This looks a bit like the mtcars data, but did you notice that the cyl column isn’t there and that there’s only 11 rows? This is because we see a subset of the complete mtcars data set where cyl == 4. By using nest(-cyl), we’ve collapsed the entire mtcars data set into two columns and three rows (one for each category in cyl).

    Aside, it’s easy to dissect data by multiple categorical variables further by dropping them in nest(). For example, we can nest our data by the number of cylinders AND whether the car is automatic or manual (am) as follows:

    mtcars %>% nest(-cyl, -am) #> # A tibble: 6 × 3 #> cyl am data #> <dbl> <dbl> <list> #> 1 6 1 <tibble [3 × 9]> #> 2 4 1 <tibble [8 × 9]> #> 3 6 0 <tibble [4 × 9]> #> 4 8 0 <tibble [12 × 9]> #> 5 4 0 <tibble [3 × 9]> #> 6 8 1 <tibble [2 × 9]>

    If you compare carefully to the above, you’ll notice that each tibble in data has 9 columns instead of 10. This is because we’ve now extracted am. Also, there are far fewer rows in each tibble. This is because each tibble contains a much smaller subset of the data. E.g., instead of all the data for cars with 4 cylinders being in one cell, this data is further split into two cells – one for automatic, and one for manual cars.

     Fitting models to nested data

    Now that we can separate data for each group(s), we can fit a model to each tibble in data using map() from the purrr package (also tidyverse). We’re going to add the results to our existing tibble using mutate() from the dplyr package (again, tidyverse). Here’s a generic version of our pipe with adjustable parts in caps:

    DATA_SET %>% nest(-CATEGORICAL_VARIABLE) %>% mutate(fit = map(data, ~ MODEL_FUNCTION(...)))

    Where you see ..., using a single dot (.) will represent each nested tibble

    Let’s start with a silly but simple example: a student t-test examining whether mpg is significantly greater than 0 for each group of cars with different cylinders:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ t.test(.$mpg))) #> # A tibble: 3 × 3 #> cyl data fit #> <dbl> <list> <list> #> 1 6 <tibble [7 × 10]> <S3: htest> #> 2 4 <tibble [11 × 10]> <S3: htest> #> 3 8 <tibble [14 × 10]> <S3: htest>

    We’ll talk about the new fit column in a moment. First, let’s discuss the new line, mutate(fit = map(data, ~ t.test(.$mpg))):

    • mutate(fit = ...) is a dplyr function that will add a new column to our tibble called fit.
    • map(data, ...) is a purrr function that iterates through each cell of the data column (which has our nested tibbles).
    • ~ t.test(.$mpg) is running the t.test for each cell. Because this takes place within map(), we must start with ~, and use . whenever we want to reference the nested tibble that is being iterated on.

    What’s each <S3: htest> in the fit column? It’s the fitted t.test() model for each nested tibble. Just like we peeked into a single data cell, let’s look into a single fit cell – for cars with 4 cylinders:

    d <- mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ t.test(.$mpg))) d$fit[d$cyl == 4] #> [[1]] #> #> One Sample t-test #> #> data: .$mpg #> t = 19.609, df = 10, p-value = 2.603e-09 #> alternative hypothesis: true mean is not equal to 0 #> 95 percent confidence interval: #> 23.63389 29.69338 #> sample estimates: #> mean of x #> 26.66364

    Looking good. So we now know how to nest() a data set by one or more groups, and fit a statistical model to the data corresponding to each group.

     Extracting fit information

    Our final goal is to obtain useful information from the fitted models. We could manually look into each fit cell, but this is tedious. Instead, we’ll extract information from our fitted models by adding one or more lines to mutate(), and using map_*(fit, ...) to iterate through each fitted model. For example, the following extracts the p.values from each t.test into a new column called p:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ t.test(.$mpg)), p = map_dbl(fit, "p.value")) #> # A tibble: 3 × 4 #> cyl data fit p #> <dbl> <list> <list> <dbl> #> 1 6 <tibble [7 × 10]> <S3: htest> 3.096529e-08 #> 2 4 <tibble [11 × 10]> <S3: htest> 2.602733e-09 #> 3 8 <tibble [14 × 10]> <S3: htest> 1.092804e-11

    map_dbl() is used because we want to return a number (a “double”) rather than a list of objects (which is what map() does). Explaining the variants of map() and how to use them is well beyond the scope of this post. The important point here is that we can iterate through our fitted models in the fit column to extract information for each group of data. For more details, I recommend reading the “The Map Functions” in R for Data Science.

     broom and unnest()

    In addition to extracting a single value like above, we can extract entire data frames of information generated via functions from the broom package (which are available for most of the common models in R). For example, the glance() function returns a one-row data frame of model information. Let’s extract this information into a new column called results:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ t.test(.$mpg)), results = map(fit, glance)) #> # A tibble: 3 × 4 #> cyl data fit results #> <dbl> <list> <list> <list> #> 1 6 <tibble [7 × 10]> <S3: htest> <data.frame [1 × 8]> #> 2 4 <tibble [11 × 10]> <S3: htest> <data.frame [1 × 8]> #> 3 8 <tibble [14 × 10]> <S3: htest> <data.frame [1 × 8]>

    If you extract information like this, the next thing you’re likely to want to do is unnest() it as follows:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ t.test(.$mpg)), results = map(fit, glance)) %>% unnest(results) #> # A tibble: 3 × 11 #> cyl data fit estimate statistic p.value #> <dbl> <list> <list> <dbl> <dbl> <dbl> #> 1 6 <tibble [7 × 10]> <S3: htest> 19.74286 35.93552 3.096529e-08 #> 2 4 <tibble [11 × 10]> <S3: htest> 26.66364 19.60901 2.602733e-09 #> 3 8 <tibble [14 × 10]> <S3: htest> 15.10000 22.06952 1.092804e-11 #> # ... with 5 more variables: parameter <dbl>, conf.low <dbl>, #> # conf.high <dbl>, method <fctr>, alternative <fctr>

    We’ve now unnested all of the model information, which includes the t value (statistic), the p value (p.value), and many others.

    We can do whatever we want with this information. For example, the below plots the group mpg means with confidence intervals generated by the t.test:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ t.test(.$mpg)), results = map(fit, glance)) %>% unnest(results) %>% ggplot(aes(x = factor(cyl), y = estimate)) + geom_bar(stat = "identity") + geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .2) + labs(x = "Cylinders (cyl)", y = "Miles Per Gallon (mpg)")

     Regression

    Let’s push ourselves and see if we can do the same sort of thing for liner regression. Say we want to examine whether the prediction of mpg by hp, wt and disp, differs for cars with different numbers of cylinders. The first significant change will be our fit variable, created as follows:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ lm(mpg ~ hp + wt + disp, data = .))) #> # A tibble: 3 × 3 #> cyl data fit #> <dbl> <list> <list> #> 1 6 <tibble [7 × 10]> <S3: lm> #> 2 4 <tibble [11 × 10]> <S3: lm> #> 3 8 <tibble [14 × 10]> <S3: lm>

    That’s it! Notice how everything else is the same. All we’ve done is swapped out a t.test() for lm(), using our variables and data in the appropriate places. Let’s glance() at the model:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ lm(mpg ~ hp + wt + disp, data = .)), results = map(fit, glance)) %>% unnest(results) #> # A tibble: 3 × 14 #> cyl data fit r.squared adj.r.squared sigma #> <dbl> <list> <list> <dbl> <dbl> <dbl> #> 1 6 <tibble [7 × 10]> <S3: lm> 0.7217114 0.4434228 1.084421 #> 2 4 <tibble [11 × 10]> <S3: lm> 0.7080702 0.5829574 2.912394 #> 3 8 <tibble [14 × 10]> <S3: lm> 0.4970692 0.3461900 2.070017 #> # ... with 8 more variables: statistic <dbl>, p.value <dbl>, df <int>, #> # logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>

    We haven’t added anything we haven’t seen already. Let’s go and plot the R-squared values to see just how much variance is accounted for in each model:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ lm(mpg ~ hp + wt + disp, data = .)), results = map(fit, glance)) %>% unnest(results) %>% ggplot(aes(x = factor(cyl), y = r.squared)) + geom_bar(stat = "identity") + labs(x = "Cylinders", y = expression(R^{2}))

    It looks to me like the model performs poorer for cars with 8 cylinders than cars with 4 or 6 cylinders.

     Row-wise values and augment()

    We’ll cover one final addition: extracting row-wise data with broom’s augment() function. Unlike glance(), augment() extracts information that matches every row of the original data such as the predicted and residual values. If we have a model that augment() works with, we can add it to our mutate call just as we added glance(). Let’s swap out glance() for augment() in the regression model above:

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ lm(mpg ~ hp + wt + disp, data = .)), results = map(fit, augment)) #> # A tibble: 3 × 4 #> cyl data fit results #> <dbl> <list> <list> <list> #> 1 6 <tibble [7 × 10]> <S3: lm> <data.frame [7 × 11]> #> 2 4 <tibble [11 × 10]> <S3: lm> <data.frame [11 × 11]> #> 3 8 <tibble [14 × 10]> <S3: lm> <data.frame [14 × 11]>

    Our results column again contains data frames, but each has as many rows as the original nested tibbles in the data columns. What happens when we unnest() it?

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ lm(mpg ~ hp + wt + disp, data = .)), results = map(fit, augment)) %>% unnest(results) #> # A tibble: 32 × 12 #> cyl mpg hp wt disp .fitted .se.fit .resid .hat #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 6 21.0 110 2.620 160.0 21.43923 0.8734029 -0.4392256 0.6486848 #> 2 6 21.0 110 2.875 160.0 20.44570 0.6760327 0.5543010 0.3886332 #> 3 6 21.4 110 3.215 258.0 20.69886 0.9595681 0.7011436 0.7829898 #> 4 6 18.1 105 3.460 225.0 19.26783 0.6572258 -1.1678250 0.3673108 #> 5 6 19.2 123 3.440 167.6 18.22410 0.7031674 0.9758992 0.4204573 #> 6 6 17.8 123 3.440 167.6 18.22410 0.7031674 -0.4241008 0.4204573 #> 7 6 19.7 175 2.770 145.0 19.90019 1.0688377 -0.2001924 0.9714668 #> 8 4 22.8 93 2.320 108.0 25.71625 1.0106110 -2.9162542 0.1204114 #> 9 4 24.4 62 3.190 146.7 22.89906 2.4068779 1.5009358 0.6829797 #> 10 4 22.8 95 3.150 140.8 21.26402 1.6910426 1.5359798 0.3371389 #> # ... with 22 more rows, and 3 more variables: .sigma <dbl>, #> # .cooksd <dbl>, .std.resid <dbl>

    Wow, there’s a lot going on here! We’ve unnested the entire data set related to the fitted regression models, complete with information like predicted (.fitted) and residual (.resid) values. Below is a plot of these predicted values against the actual values. For more details on this, see my previous post on plotting residuals.

    mtcars %>% nest(-cyl) %>% mutate(fit = map(data, ~ lm(mpg ~ hp + wt + disp, data = .)), results = map(fit, augment)) %>% unnest(results) %>% ggplot(aes(x = mpg, y = .fitted)) + geom_abline(intercept = 0, slope = 1, alpha = .2) + # Line of perfect fit geom_point() + facet_grid(cyl ~ .) + theme_bw()

    This figure is showing us the fitted results of three separate regression analyses: one for each subset of the mtcars data corresponding to cars with 4, 6, or 8 cylinders. As we know from above, the R2 value for cars with 8 cylinders is lowest, and it’s somewhat evident from this plot (though the small sample sizes make it difficult to feel confident).

     randomForest example

    For anyone looking to sink their teeth into something a little more complex, below is a fully worked example of examining the relative importance of variables in a randomForest() model. The model predicts the arrival delay of flights using time-related variables (departure time, year, month and day). Relevant to this post, we fit this model to the data separately for each of three airline carriers.

    Notice that this implements the same code we’ve been using so far, with just a few tweaks to select an appropriate data set and obtain information from the fitted models.

    The resulting plot suggests to us that the importance of a flight’s day for predicting it’s arrival delay varies depending on the carrier. Specifically, it is reasonably informative for predicting the arrival delay of Pinnacle Airlines (9E), not so useful for Virgin America (VX), and practically useless for Alaska Airlines (AS).

    library(randomForest) library(nycflights13) # Convenience function to get importance information from a randomForest fit # into a data frame imp_df <- function(rf_fit) { imp <- randomForest::importance(rf_fit) vars <- rownames(imp) imp %>% tibble::as_tibble() %>% dplyr::mutate(var = vars) } set.seed(123) flights %>% # Selecting data to work with na.omit() %>% select(carrier, arr_delay, year, month, day, dep_time) %>% filter(carrier %in% c("9E", "AS", "VX")) %>% # Nesting data and fitting model nest(-carrier) %>% mutate(fit = map(data, ~ randomForest(arr_delay ~ ., data = ., importance = TRUE, ntree = 100)), importance = map(fit, imp_df)) %>% # Unnesting and plotting unnest(importance) %>% ggplot(aes(x = `%IncMSE`, y = var, color = `%IncMSE`)) + geom_segment(aes(xend = min(`%IncMSE`), yend = var), alpha = .2) + geom_point(size = 3) + facet_grid(. ~ carrier) + guides(color = "none") + theme_bw()

     Sign off

    Thanks for reading and I hope this was useful for you.

    For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

    If you’d like the code that produced this blog, check out the blogR GitHub repository.

    To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Introduction to R: free basic R course on DataCamp in italian

    Mon, 2016-09-19 05:10

    (This article was first published on MilanoR, and kindly contributed to R-bloggers)

    his article was originally posted on Quantide blog – see here (Italian version).

    Good news to all italian speakers who are leaning into the R world but are still a bit afraid of what they might find (or not find) on their first face to face R experience:

    Quantide, in collaboration with DataCamp, is offering a free online introductory R course entirely in Italian.

    Who is this course for

    The course is free and open to anyone who wishes to attend it.

    Since the main objective is to introduce the basics of R, the course is a particularly good fit for:

    • Statistics students who need a quick start to R.
    • Professionals/employees who would like to start using R and need a quick roadmap of the basics.
    • Anyone who would like to get into the data science field and get familiar with its tools.
    When and where will the course be available

    The course is already available on DataCamp’s platform. You can access it by registering on DataCamp. Click here to be redirected to the course home page.

    What to expect

    The course is made of six chapters. Each chapter introduces the student to a new R concept, gradually explains R core structures and tests your understanding of the concept with in-browser coding challenges.

    1. Introduction to the basics
    2. Vectors
    3. Matrices
    4. Factors
    5. Data frames
    6. Lists

    The exercises proposed can be completed in an interactive in-browser R session and for each assignment a feedback is provided.

    What will I be able to do after having attended the course

    After having completed the course, you will be able to do basic data manipulation in R, use the main data structures, and do your first simple data analysis. In particular, you will have gained the basic knowledge needed to dive further into more advanced R courses.

    Time needed and schedule

    You can take the course whenever you like: just sign up to DataCamp and get started every time during the day when you feel like doing it. There’s no rush, you can complete the exercises at your own pace. DataCamp saves your progress and your answers so that you can look back at how you solved a particular exercise.

    The estimated time needed to complete the whole course is about four hours. Once you have started a chapter you can do only some exercises, take a break and then start again from where you left. This flexible structure is perfectly suited to those who have little time here and there but at the same time would like to learn R.

    It is possible to skip exercises and chapters too, but it is advisable to take the course sequentially and complete each step before going forward.

    Course requirements

    Good news: there are no requirements, you just need a PC (or a Mac

    Categories: Methodology Blogs

    Analyzing Stack Overflow questions and tags with the StackLite dataset

    Sun, 2016-09-18 20:06

    (This article was first published on analytics for fun, and kindly contributed to R-bloggers)

    The guys at Stack Overflow have recently released a very interesting dataset containing the entire history of questions made by users since the beginning of the site, back in 2008. It’s called…

    <<Keep reading>>

    To leave a comment for the author, please follow the link and comment on their blog: analytics for fun. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    vecpack: an R package for packing stuff into vectors

    Sun, 2016-09-18 12:20

    (This article was first published on R – dahtah, and kindly contributed to R-bloggers)

    Here’s a problem I’ve had again and again: let’s say you’ve defined a statistical model with several parameters. One of them is a scalar. Another is a matrix. The third one is a vector, and so on. When fitting the model the natural thing to do is to write a likelihood function that takes as many arguments as you have parameters in your model: i.e., lik(x,y,z) where x is a scalar, y a matrix and z a vector. The problem is that, while it’s the natural way of writing that function, that’s not what optimisers like “optim” want: they want a function with a single argument, and that argument should be a vector. So you have to pack everything into a vector, and write a whole lot of boilerplate code to unpack all the parameters out of that vector.

    vecpack saves you from having to write all that boilerplate:

    devtools::install_github("dahtah/vecpack") library(vecpack) #A cost function in two arguments: cost <- function(a,b)  (3*a-b+2)^2 #Call optim via vpoptim res <- vpoptim(list(a=1,b=0),cost) res$par

    vecpack knows how to automatically pack and unpack scalars, vectors, matrices and images (from the imager package). It’s also very easy to extend.

    The package is quite new, and not on CRAN yet. Feedback welcome, either here or on the issues page on github.

     

    To leave a comment for the author, please follow the link and comment on their blog: R – dahtah. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs