R bloggers

Syndicate content R-bloggers
R news and tutorials contributed by (600) R bloggers
Updated: 5 hours 14 min ago

Outlier App: An Interactive Visualization of Outlier Algorithms

Fri, 2016-12-30 07:42

(This article was first published on DataScience+, and kindly contributed to R-bloggers)

I was recently trying various outlier detection algorithms. For me, the best way to understand an algorithm is to tinker with it. I built a shiny app that allows you to play around with various outlier algorithms and wanted to share it with everyone.

The shiny app is available on my site, but even better, the code is on github for you to run locally or improve! Let me give you a quick tour of the app in this post. If you prefer, I have also posted a video that provides background on the app. Another tutorial how to build a interactive web apps with shiny is published at DataScience+.

Background

The available algorithms include:

– Hierarchical Clustering (DMwR)
– Kmeans Euclidean Distance
– Kmeans Mahalanobis
– Kmeans Manhattan
– Fuzzy kmeans – Gustafson and Kessel
– Fuzzy k-medoids
– Fuzzy k-means with polynomial fuzzifier
– Local Outlier Factor (dbscan)
– RandomForest (proximity from randomForest)
– Isolation Forest (IsolationForest)
– Autoencoder (Autoencoder)
– FBOD and SOD (HighDimOut)

There are also a wide range of datasets to try as well. They include randomly scattered points, defined clusters, and some more unusual patterns like the smiley face and spirals. Additionally, you can use your mouse to add and/or remove points by clicking directly within the visualization. This allows you to create your own dataset.

Using the app

Once the data is loaded, you can start exploring. One thing you can do is look at the effect scaling can have. In this example, you can see how outliers differ when scaling is used with kmeans. The values on the far right no longer dominate the distance measurements, and there are now outliers from other areas:

It quickly becomes apparent that different algorithms may select different outliers. In this case, you see a difference between the outliers selected using an autoencoder versus isolation forest.

Another example here is the difference between chosen outliers using kmeans versus fuzzy kmeans:

A density based algorithm can also select different outliers versus a distance based algorithm. This example nicely shows the difference between kmeans and lof (local outlier factor from dbscan)

An important part of using this visualization is studying the distance numbers that are calculated. Are these numbers meshing with your intuition? How big of a quantitative difference is there between outliers and other points?

3D+ App?

The next thing is whether to expand this to larger datasets. This is something that you would run locally (large datasets take too long to run for my shiny server). The downside of larger datasets is that it gets tricker to visualize them. For now, I am using a TSNE plot. I am open to suggestions, but the intent here is a way to evaluate outlier algorithms on a variety of datasets.

Source Code

The source code for the outlier app is on github. The app is built off a variety of R packages and could easily be extended to new packages or incorporate additional datasets. Please send me bug fixes, additional algorithms, tighter code, or ideas for improving the app.

    Related Post

    1. Creating an animation using R
    2. The importance of Data Visualization
    3. ggplot2 themes examples
    4. Map the Life Expectancy in United States with data from Wikipedia
    5. What can we learn from the statistics of the EURO 2016 – Application of factor analysis

    To leave a comment for the author, please follow the link and comment on their blog: DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    dotplot for GSEA result

    Thu, 2016-12-29 23:58

    (This article was first published on R on Guangchuang YU, and kindly contributed to R-bloggers)

    For GSEA analysis, we are familar with the above figure which shows the running enrichment score. But for most of the software, it lack of visualization method to summarize the whole enrichment result.

    In DOSE (and related tools including clusterProfiler, ReactomePA and meshes), we provide enrichMap and cnetplot to summarize GSEA result.

    Here is an example of using enrichMap which is good to visualize relationship among enriched gene sets.

    cnetplot excel on visualizing relationship among gene sets and corresponding core genes.

    Now DOSE support visualize GSEA result using dotplot which can visualize more enriched gene sets in one figure. This is a feature request from @guidohooiveld.

    dotplot was previously implemented in DOSE to visualize hypergeometric test result. I modified it to support GSEA result.

    Internally, .sign was reserved for the sign of NES (activated for NES > 0 and suppressed for NES < 0). We can pass split parameter and then it will apply showCategory by splitting the results. The following example plot 30 activated and 30 suppressed enriched disease sets.

    PS: Count is the number of core genes and GeneRatio is Count/setSize.

    Citation

    G Yu, LG Wang, GR Yan, QY He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609.

    To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Pokemon and TrelliscopeJS!

    Thu, 2016-12-29 19:00

    (This article was first published on Ryan Hafen, and kindly contributed to R-bloggers)

    I’m always looking for ways to spark my kid’s interest in computers, data, etc. This has proven to be more difficult than I thought it would be (kids these days…). I suspect this may have something to do with the ubiquity of electronic devices that “just work”, making them less novel and less interesting to tinker with, but speculation on this is a post for another time…

    Anyway, all of my kids are crazy into Pokemon so when I came across some Pokemon data the other day that leant itself very nicely to a Trelliscope display, I thought I might have a chance to engage them. And then I thought why not write a post about it. You can find the resulting display and code to recreate it in this post. Hope you enjoy!

    To start out, here’s the display:

    If this display doesn’t appear correctly for you (because of blog aggregators, etc.), you can follow this link to the display in a dedicated window. For better viewing, you can also click the bottom right “fullscreen” button to expand the display to fill the window.

    The data from which this was created is a simple data frame of Pokemon statistics, based on this source (which borrows from here). I slightly modified the data to add some variables that enhance the display (I changed the image URL to a better source, added a variable “pokedex” that provides a link to the pokemon’s pokedex entry on pokemon.com, and removed a few special Pokemon that I couldn’t find on pokedex).

    Since this data is simply a data frame where each row refers to a Pokemon, it lends itself nicely to a Trelliscope display showing an image of the Pokemon as the panel and allowing interaction with the Pokemon being viewed based on the various statistics provided.

    Here’s the code to make the display. Once the data is read, it’s just a few lines.

    # install packages if not installed devtools::install_github("hafen/trelliscopejs") install.packages(c("readr", "dplyr")) library(readr) library(dplyr) library(trelliscopejs) # read the data (making "_id" columns strings) pok <- read_csv("https://raw.githubusercontent.com/hafen/pokRdex/master/pokRdex_mod.csv") %>% mutate_at(vars(matches("_id$")), as.character) # take a look glimpse(pok) Observations: 801 Variables: 30 $ pokemon <chr> "bulbasaur", "ivysaur", "venusaur", "ve... $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, ... $ species_id <chr> "1", "2", "3", "3", "4", "5", "6", "6",... $ height <int> 7, 10, 20, 24, 6, 11, 17, 17, 17, 5, 10... $ weight <int> 69, 130, 1000, 1555, 85, 190, 905, 1105... $ base_experience <int> 64, 142, 236, 281, 62, 142, 240, 285, 2... $ type_1 <chr> "grass", "grass", "grass", "grass", "fi... $ type_2 <chr> "poison", "poison", "poison", "poison",... $ attack <int> 49, 62, 82, 100, 52, 64, 84, 130, 104, ... $ defense <int> 49, 63, 83, 123, 43, 58, 78, 111, 78, 6... $ hp <int> 45, 60, 80, 80, 39, 58, 78, 78, 78, 44,... $ special_attack <int> 65, 80, 100, 122, 60, 80, 109, 130, 159... $ special_defense <int> 65, 80, 100, 120, 50, 65, 85, 85, 115, ... $ speed <int> 45, 60, 80, 80, 65, 80, 100, 100, 100, ... $ ability_1 <chr> "overgrow", "overgrow", "overgrow", "th... $ ability_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,... $ ability_hidden <chr> "chlorophyll", "chlorophyll", "chloroph... $ color_1 <chr> "#78C850", "#78C850", "#78C850", "#78C8... $ color_2 <chr> "#A040A0", "#A040A0", "#A040A0", "#A040... $ color_f <chr> "#81A763", "#81A763", "#81A763", "#81A7... $ egg_group_1 <chr> "monster", "monster", "monster", "monst... $ egg_group_2 <chr> "plant", "plant", "plant", "plant", "dr... $ url_image <chr> "http://assets.pokemon.com/assets/cms2/... $ generation_id <chr> "1", "1", "1", NA, "1", "1", "1", NA, N... $ evolves_from_species_id <chr> NA, "1", "2", NA, NA, "4", "5", NA, NA,... $ evolution_chain_id <chr> "1", "1", "1", NA, "2", "2", "2", NA, N... $ shape_id <chr> "8", "8", "8", NA, "6", "6", "6", NA, N... $ shape <chr> "quadruped", "quadruped", "quadruped", ... $ pokebase <chr> "bulbasaur", "ivysaur", "venusaur", "ve... $ pokedex <chr> "http://www.pokemon.com/us/pokedex/bulb...

    Now we can create a Trelliscope display by specifying url_image as the source for the panel images. We also specify a default state indicating that the values for the variables pokemon and pokedex should be shown as labels by default.

    pok %>% mutate(panel = img_panel(url_image)) %>% trelliscope("pokemon", nrow = 3, ncol = 6, state = list(labels = c("pokemon", "pokedex")))

    This will produce the interactive plot shown at the top of this post. You can use the display to find Pokemon based on sorting or filtering on several of their attributes.

    Note that despite my kids constantly telling me about and showing me their Pokemon cards, I am not a Pokemon expert, so there may be some interesting things I am missing. But I can say that my kids were finally impressed and engaged with something that I showed them. Success!

    If this is your first exposure to Trelliscope and you are interested in other things it can do, please see my original blog post.

    This is my last post of the year. Happy new year!

    (function() { trelliscopeApp('aa7b91b4', 'https://hafen.github.io/trelliscopejs-demo/pokemon/appfiles/config.jsonp'); })();

    To leave a comment for the author, please follow the link and comment on their blog: Ryan Hafen. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    a Galton-Watson riddle

    Thu, 2016-12-29 18:16

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    The Riddler of this week has an extinction riddle which summarises as follows:

    One observes a population of N individuals, each with a probability of 10⁻⁴ to kill the observer each day. From one day to the next, the population decreases by one individual with probability

    K√N 10⁻⁴

    What is the value of K that leaves the observer alive with probability ½?

    Given the sequence of population sizes N,N¹,N²,…, the probability to remain alive is

    where the sum stops with the (sure) extinction of the population. Which is the moment generating function of the sum. At x=1-10⁻⁴. Hence the problem relates to a Galton-Watson extinction problem. However, given the nature of the extinction process I do not see a way to determine the distribution of the sum, except by simulation. Which returns K=26.3 for the specific value of N=9.

    N=9 K=3*N M=10^4 vals=rep(0,M) targ=0 ite=1 while (abs(targ-.5)>.01){ for (t in 1:M){ gen=vals[t]=N while (gen>0){ gen=gen-(runif(1)<K*sqrt(gen)/10^4) vals[t]=vals[t]+gen} } targ=mean(exp(vals*log(.9999))) print(c(as.integer(ite),K,targ)) if (targ<.5){ K=K*ite/(1+ite)}else{ K=K/(ite/(1+ite))} ite=ite+1}

    Filed under: R, Travel Tagged: Francis Galton, Galton-Watson extinction, R, The Riddler

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    7 Visualizations You Should Learn in R

    Thu, 2016-12-29 15:51

    (This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers)

    7 Visualizations You Should Learn in R

    With ever increasing volume of data, it is impossible to tell stories without visualizations. Data visualization is an art of how to turn numbers into useful knowledge.

    R Programming lets you learn this art by offering a set of inbuilt functions and libraries to build visualizations and present data. Before the technical implementations of the visualization, let’s see first how to select the right chart type.

    Selecting the Right Chart Type There are four basic presentation types:
    1. Comparison
    2. Composition
    3. Distribution
    4. Relationship

    To determine which amongst these is best suited for your data, I suggest you should answer a few questions like,

    • How many variables do you want to show in a single chart?
    • How many data points will you display for each variable?
    • Will you display values over a period of time, or among items or groups?

    Below is a great explanation on selecting a right chart type by Dr. Andrew Abela.

    In your day-to-day activities, you’ll come across the below listed 7 charts most of the time.

    1. Scatter Plot
    2. Histogram
    3. Bar & Stack Bar Chart
    4. Box Plot
    5. Area Chart
    6. Heat Map
    7. Correlogram

    We’ll use ‘Big Mart data’ example as shown below to understand how to create visualizations in R. You can download the full dataset from here.

    Now let’s see how to use these visualizations in R

    1. Scatter Plot

    When to use: Scatter Plot is used to see the relationship between two continuous variables.

    In our above mart dataset, if we want to visualize the items as per their cost data, then we can use scatter plot chart using two continuous variables, namely Item_Visibility & Item_MRP as shown below.

    Here is the R code for simple scatter plot using function ggplot() with geom_point().

    library(ggplot2) // ggplot2 is an R library for visualizations train. ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()

    Now, we can view a third variable also in same chart, say a categorical variable (Item_Type) which will give the characteristic (item_type) of each data set. Different categories are depicted by way of different color for item_type in below chart.

    R code with an addition of category:

    ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw() + labs(title="Scatterplot")

    We can even make it more visually clear by creating separate scatter plots for each separate Item_Type as shown below.

    R code for separate category wise chart:

    ggplot(train, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)

    Here, facet_wrap works superb & wraps Item_Type in rectangular layout.

    2. Histogram

    When to use: Histogram is used to plot continuous variable. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and see the effect it has on visualization.

    From our mart dataset, if we want to know the count of items on basis of their cost, then we can plot histogram using continuous variable Item_MRP as shown below.


    Here is the R code for simple histogram plot using function ggplot() with geom_histogram().

    ggplot(train, aes(Item_MRP)) + geom_histogram(binwidth = 2)+ scale_x_continuous("Item MRP", breaks = seq(0,270,by = 30))+ scale_y_continuous("Count", breaks = seq(0,200,by = 20))+ labs(title = "Histogram") 3. Bar & Stack Bar Chart

    When to use: Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.

    From our dataset, if we want to know number of marts established in particular year, then bar chart would be most suitable option, use variable Establishment Year as shown below.

    Here is the R code for simple bar plot using function ggplot() for a single continuous variable.

    ggplot(train, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+ scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) + scale_y_continuous("Count", breaks = seq(0,1500,150)) + coord_flip()+ labs(title = "Bar Chart") + theme_gray() Vertical Bar Chart:

    As a variation, you can remove coord_flip() parameter to get the above bar chart vertically.

    To know item weights (continuous variable) on basis of Outlet Type (categorical variable) on single bar chart, use following code:
    ggplot(train, aes(Item_Type, Item_Weight)) + geom_bar(stat = "identity", fill = "darkblue") + scale_x_discrete("Outlet Type")+ scale_y_continuous("Item Weight", breaks = seq(0,15000, by = 500))+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(title = "Bar Chart")

    Stacked Bar chart:

    Stacked bar chart is an advanced version of bar chart, used for visualizing a combination of categorical variables.

    From our dataset, if we want to know the count of outlets on basis of categorical variables like its type (Outlet Type) and location (Outlet Location Type) both, stack chart will visualize the scenario in most useful manner.

    Here is the R code for simple stacked bar chart using function ggplot().

    ggplot(train, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+ labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets") 4. Box Plot

    When to use: Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.

    From our dataset, if we want to identify each outlet’s detailed item sales including minimum, maximum & median numbers, box plot can be helpful. In addition, it also gives values of outliers of item sales for each outlet as shown in below chart.

    The black points are outliers. Outlier detection and removal is an essential step of successful data exploration.

    Here is the R code for simple box plot using function ggplot() with geom_boxplot.

    ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+ scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+ labs(title = "Box Plot", x = "Outlet Identifier") 5. Area Chart

    When to use: Area chart is used to show continuity across a variable or data set. It is very much same as line chart and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.

    From our dataset, when we want to analyze the trend of item outlet sales, area chart can be plotted as shown below. It shows count of outlets on basis of sales.

    Here is the R code for simple area chart showing continuity of Item Outlet Sales using function ggplot() with geom_area.

    ggplot(train, aes(Item_Outlet_Sales)) + geom_area(stat = "bin", bins = 30, fill = "steelblue") + scale_x_continuous(breaks = seq(0,11000,1000))+ labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count") 6. Heat Map

    When to use: Heat Map uses intensity (density) of colors to display relationship between two or three or many variables in a two dimensional image. It allows you to explore two dimensions as the axis and the third dimension by intensity of color.

    From our dataset, if we want to know cost of each item on every outlet, we can plot heatmap as shown below using three variables Item MRP, Outlet Identifier & Item Type from our mart dataset.


    The dark portion indicates Item MRP is close 50. The brighter portion indicates Item MRP is close to 250.

    Here is the R code for simple heat map using function ggplot().

    ggplot(train, aes(Outlet_Identifier, Item_Type))+ geom_raster(aes(fill = Item_MRP))+ labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+ scale_fill_continuous(name = "Item MRP") 7. Correlogram

    When to use: Correlogram is used to test the level of co-relation among the variable available in the data set. The cells of the matrix can be shaded or colored to show the co-relation value.

    Darker the color, higher the co-relation between variables. Positive co-relations are displayed in blue and negative correlations in red color. Color intensity is proportional to the co-relation value.

    From our dataset, let’s check co-relation between Item cost, weight, visibility along with Outlet establishment year and Outlet sales from below plot.

    In our example, we can see that Item cost & Outlet sales are positively correlated while Item weight & its visibility are negatively correlated.

    Here is the R code for simple correlogram using function corrgram().

    install.packages("corrgram") library(corrgram) corrgram(train, order=NULL, panel=panel.shade, text.panel=panel.txt, main="Correlogram")

    Now I guess it should be easy for you to visualize the data using ggplot2 library in R Programming.

    Apart from visualizations, you can learn about data mining in R through our webinar recording on Google Analytics Data Mining with R (includes 3 Real Applications)

    To know more or for any assistance on R programming, please drop us a comment with your details & we will be glad to assist you!!

    The post 7 Visualizations You Should Learn in R appeared first on Tatvic Blog.

    To leave a comment for the author, please follow the link and comment on their blog: Tatvic Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Using R to prevent food poisoning in Chicago

    Thu, 2016-12-29 14:56

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    There are more than 15,000 restaurants in Chicago, but fewer than 40 inspectors tasked with making sure they comply with food-safety standards. To help prioritize the facilities targeted for inspection, the City of Chicago used R to create a model that predicts which restaurants are most likely to fail an inspection. Using this model to deploy inspectors, the City is able to detect unsafe restaurants more than a week sooner than by using traditional selection methods, and cite 37 additional restaurants per month.

    Chicago's Department of Public Health used the R language to build and deploy the model, and made the code available as an open source project on GitHub. The reasons given are twofold:

    An open source approach helps build a foundation for other models attempting to forecast violations at food establishments. The analytic code is written in R, an open source, widely-known programming language for statisticians. There is no need for expensive software licenses to view and run this code.

    Releasing the model as open source has had benefits for beyond Chicago as well: Montogomery County, MD adopted the process and also saw improvements in its food safety inpection process.

    You can see how the model is used in practice in the video below from PBS NewsHour. Fast forward to the 3:00 mark to see the Tom Schenk, Chief Data Officer for the City of Chicago, describe how the data science team there used R to develop the model. (There's also a close-up of R code using the data.table package around the 6:45 mark.)

    The video also describes the Foodborne Chicago Twitter detection system for flagging tweets describing food poisoning in Chicago (also implemented with R).

    PBS NewsHour: Up to code? An algorithm is helping Chicago health officials predict restaurant safety violations (via reader MD)

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Intermediate Tree 1

    Thu, 2016-12-29 12:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    If you followed through the Basic Decision Tree exercise, this should be useful for you. This is like a continuation but we add so much more. We are working with a bigger and badder datasets. We will be also using techniques we learned from model evaluation and work with ROC, accuracy and other metrics.

    Answers to the exercises are available here.

    If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1
    read in the adult.csv file with header=FALSE. Store this in df. Use str() command to see the dataframe. Download the Data from here

    Exercise 2
    You are given the meta_data that goes with the CSV. You can download this here Use that to add the column names for your dataframe. Notice the df is ordered going from V1,V2,V3 _ _ and so on. As a side note, it is always best practice to use that to match and see if all the columns are read in correctly.

    Exercise 3
    Use the table command and print out the distribution of the class feature.

    Exercise 4
    Change the class column to binary.

    Learn more about decision trees in the online courses

    Exercise 5
    Use the cor() command to see the corelation of all the numeric and integer columns including the class column. Remember that numbers close to 1 means high corelation and number close to 0 means low. This will give you a rough idea for feature selection

    Exercise 6
    Split the dataset into Train and Test sample. You may use sample.split() and use the ratio as 0.7 and set the seed to be 1000. Make sure to install and load caTools package.

    Exercise 7
    Check the number of rows of Train
    Check the number of rows of Test

    Exercise 8
    We are ready to use decision tree in our dataset. Load the package “rpart” and “rpart.plot” If it is not installed, then use the install.packages() commmand.

    Exercise 9
    Use rpart to build the decision tree on the Train set. Include all features.Store this model in dec

    Exercise 10
    Use the prp() function to plot the decision tree. If you get any error use this code before the prp() command

    par(mar = rep(2, 4))

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    An Interview With Jo Hardin, author of Foundations of Inference

    Thu, 2016-12-29 08:08

    (This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

    Hey R fans! A new episode of DataCamp’s DataChats video series is out! 

    In this episode, we interview Jo Hardin. Jo is a Professor of Mathematics at Pomona College with many years of R experience.  She has a pure passion for education and has been working on the ASA’s undergraduate curriculum guidelines where she strongly advocated the infusion of data science into the undergraduate statistics curriculum.

    Together with Nick, Jo talks about R’s place in the stats curriculum, the role of technology in education, what advice she would give to people just starting in statistics, bootstrapping, and much more.

    We hope that you enjoy watching this series and make sure you don’t miss any of our upcoming episodes by subscribing to DataCamp’s YouTube channel!

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Reactive acronym list in stratvis, a timevis-based Shiny app

    Thu, 2016-12-29 06:40

    (This article was first published on data prone - R, and kindly contributed to R-bloggers)

    Abstract

    I present a method for reactively updating a table of acronyms from a Shiny interactive timeline using renderDataTable and timevis. The method is used in the new Shiny app, stratvis.

    The stratvis app

    The stratvis Shiny app provides a rich and fully interactive timeline
    visualization of hierarchical items (e.g. strategy, policy, guidance, goals,
    objectives, milestones, decisions) for a strategic view of organizational
    activity. The app uses the timevis R package, which is based on
    the vis.js Timeline module and the
    htmlwidgets R package. For convenience, I’ve hosted
    a demo of stratvis on
    my Shiny server, so you can scroll through the interactive
    timeline and watch the acronym list adjust automatically.



    Relevant background on the timevis API

    When a timeline widget is created in a Shiny app with the timevis method, four Shiny inputs
    are also created (and updated as the interactive timeline is manipulated within the app).
    The names of the inputs are based upon the name given to the timeline object (with _data,
    _ids, _selected, and _window appended). We will use the _data and _window appended
    objects in our
    method to build the reactive acronym list below.
    If the interactive timeline object is timelineGroups (as it
    is in the stratvis demo), then the following input variables are available:

    • input$timelineGroups_data – will return a data.frame containing the data
      of the items in the timeline. The input is updated every time an item is
      modified, added, or removed.

    • input$timelineGroups_window – will return a 2-element vector containing the minimum and maximum dates currently visible in the timeline. The input is updated every time the viewable window of dates is updated (by zooming or moving the window).

    Reactively update acronym list

    The block of code in server.R that generates the reactively updated acronym list:

    output$acronyms <- DT::renderDataTable({ firstDate <- input$timelineGroups_window[1] lastDate <- input$timelineGroups_window[2] data <- input$timelineGroups_data %>% select(label,start,end) %>% filter((is.na(end) & (start > firstDate & start < lastDate)) | (!is.na(end) & (start < lastDate & end > firstDate))) acronyms %>% filter(grepl( paste(unlist(str_split(data$label,pattern=" ")),collapse="|"), acronym)) %>% select(acronym, full) }, options = list( paging = FALSE, order = list(list(1, 'asc')), rownames = FALSE, columnDefs = list( list(visible=FALSE,targets=0) ), colnames = c("Acronym","") ) )

    Obviously, the core functionality is provided by the renderDataTable
    method, which makes a reactive version of a function that returns a data
    frame (or matrix) to be rendered with the DataTables library.
    As mentioned above, the timelineGroups object is the actual timevis interactive timeline
    object, and the following code ensure that the acronyms Shiny output is
    reactively updated whenever the timelineGroups_window Shiny
    input changes (i.e. when the minimum or maximum date in the visible timeline
    window changes):

    firstDate <- input$timelineGroups_window[1] lastDate <- input$timelineGroups_window[2]

    The label, start, and end variables within the timelineGroups_data Shiny
    input are selected and then filtered for only those items visible in the timeline:

    data <- input$timelineGroups_data %>% select(label,start,end) %>% filter((is.na(end) & (start > firstDate & start < lastDate)) | (!is.na(end) & (start < lastDate & end > firstDate)))

    The call to filter accomodates ‘point’ and ‘box’ type objects (type = 'point' or
    type = 'box') that do not have end dates (end = NA) as well as ‘range’ and ‘background’
    type objects (type = 'range' or type = 'background'). Note that this approach
    would also update the acronym list if an item is added interactively to the timeline.

    Each row of the acronyms data frame is an acronym (acronym variable) and
    the full phrase to which it corresponds (full variable). The following code uses
    grepl to filter acronyms down to only those items present in the data object
    created above. The regular expression used in the call to grepl is simply
    all of the words (separated by white space) in data$label pasted together with
    “|” characters between them.

    acronyms %>% filter(grepl( paste(unlist(str_split(data$label,pattern=" ")),collapse="|"), acronym)) %>% select(acronym, full) Feedback welcome

    If you have any feedback on the above approach or the stratvis
    app in general, please
    leave a comment below or use the Tweet button.
    As with any of my projects, feel free to fork the stratvis repo
    and submit a pull request if you wish to contribute.

    Download
    Fork

    To leave a comment for the author, please follow the link and comment on their blog: data prone - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    The Instant Rise of Machine Intelligence?

    Wed, 2016-12-28 19:00

    (This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

    Currently the news are filled with articles about the rise of machine intelligence, artificial intelligence and deep learning.
    For the average reader it seems that there was this single technical breakthrough that made AI possible. While I strongly believe in the fascinating opportunities around deep learning for image recognition, natural language processing and even end-to-end “intelligent” systems (e.g. chat bots), I wanted to get a better feeling of the recent technological progress.

    First I read about tensorflow (for R) and watched a number of great talks about it. Do not miss Nuts and Bolts of Applying Deep Learning (Andrew Ng) and Tensorflow and deep learning – without at PhD by Martin Görner. Second I started to look at publications and error improvements on public datasets.
    There is surprisingly little information about the improvement rate of machine learning on public datasets. I found one great resource I would like to analyse in the following post.
    All datasets (“MINST”,”CIFAR-10”, “CIFAR-100”, “STL-10”, “SVHN”) are image classification tasks and results are published in academic (peer-reviewed) outlets.
    In order to better aggregate the results, I report the trimmed (10 percent) mean error rate per year per dataset.

    We see that the mean reported error drops in all datasets per year. Each panel has it’s own x,y-scales, however inspected closely, we see that there is no apparent drop in the error rate in one particular year. Rather, it seems that the improvement rate per dataset is a linear function of the time.
    To get a better look at the best performer, let’s do the same plot with just the lowest reported error rates per year.

    Again, there is not a single year that appears to mark the rise of machines but it looks like a continuous process.
    If it is a continuous process, let’s quickly summarise the learning rate per dataset.

    Dataset Improvement Years PP. Improvement per Year CIFAR-10 3% 6 0.4% CIFAR-100 29% 5 5.9% MINST 63% 13 4.8% STL-10 36% 5 7.1% SVHN 16% 6 2.6%

    The improvement column lists the percent improvement from the first year best publication to the current best publication. The dataset have been around for various timeframes (indicated in column 2). Finally we get the percentage point increase per year. While the improvement varies, across the board it seem that 5% improvement is reasonable.

    So if there is not a single year that marks instant spike in improvement, what is the hype about? I assume that with the steady process in the recent years AI seems to approach or even surpass human-level performance on some tasks. Basically the news is not a technology breakthrough but rather a passing of an important threshold.

    In case you want to have a look at the data yourself:

    To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Tip: Optimize your Rcpp loops

    Wed, 2016-12-28 19:00

    (This article was first published on Florian Privé, and kindly contributed to R-bloggers)

    In this post, I will show you how to optimize your Rcpp loops so that they are 2 to 3 times faster than a standard implementation.

    Context Real data example

    For this post, I will use a big.matrix which represents genotypes for 15,283 individuals, corresponding to the number of mutations (0, 1 or 2) at 287,155 different loci. Here, I will use only the first 10,000 loci (columns).

    What you need to know about the big.matrix format:

    • you can easily and quickly access matrice-like objects stored on disk,
    • you can use different types of storage (I use type char to store each element on only 1 byte),
    • it is column-major ordered as standard R matrices,
    • you can access elements of a big.matrix using X[i, j] in R,
    • you can access elements of a big.matrix using X[j][i] in Rcpp,
    • you can get a RcppEigen or RcppArmadillo view of a big.matrix (see Appendix).
    • for more details, go to the GitHub repo.

    Peek at the data:

    print(dim(X)) ## [1] 15283 10000 print(X[1:10, 1:12]) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] ## [1,] 2 0 2 0 2 2 2 1 2 2 2 2 ## [2,] 2 0 1 2 1 1 1 1 2 1 2 2 ## [3,] 2 0 2 2 2 2 1 1 2 1 2 2 ## [4,] 2 2 0 2 0 0 0 2 2 2 0 2 ## [5,] 2 1 2 2 2 2 2 1 2 2 2 2 ## [6,] 2 1 2 1 2 2 1 1 2 2 2 2 ## [7,] 2 0 2 0 2 2 2 0 2 1 2 2 ## [8,] 2 1 1 2 1 1 1 1 2 1 2 2 ## [9,] 2 1 2 2 2 2 2 2 2 2 2 2 ## [10,] 2 0 2 1 2 2 2 0 2 1 2 1 What I needed

    I needed a fast matrix-vector multiplication between a big.matrix and a vector. Moreover, I could not use any RcppEigen or RcppArmadillo multiplication because I needed some options of efficiently subsetting columns or rows in my matrix (see Appendix).

    Writing this multiplication in Rcpp is no more than two loops:

    // [[Rcpp::depends(RcppEigen, bigmemory, BH)]] #include <RcppEigen.h> #include <bigmemory/MatrixAccessor.hpp> using namespace Rcpp; // [[Rcpp::export]] NumericVector prod1(XPtr<BigMatrix> bMPtr, const NumericVector& x) { MatrixAccessor<char> macc(*bMPtr); int n = bMPtr->nrow(); int m = bMPtr->ncol(); NumericVector res(n); int i, j; for (j = 0; j < m; j++) { for (i = 0; i < n; i++) { res[i] += macc[j][i] * x[j]; } } return res; }

    One test:

    y <- rnorm(ncol(X)) print(system.time( test <- prod1(X@address, y) )) ## user system elapsed ## 0.664 0.004 0.668

    What comes next should be transposable to other applications and other types of data.

    Unrolling optimization

    While searching for optimizing my multiplication, I came across this Stack Overflow answer.

    Unrolling in action:

    // [[Rcpp::depends(RcppEigen, bigmemory, BH)]] #include <RcppEigen.h> #include <bigmemory/MatrixAccessor.hpp> using namespace Rcpp; // [[Rcpp::export]] NumericVector prod4(XPtr<BigMatrix> bMPtr, const NumericVector& x) { MatrixAccessor<char> macc(*bMPtr); int n = bMPtr->nrow(); int m = bMPtr->ncol(); NumericVector res(n); int i, j; for (j = 0; j <= m - 4; j += 4) { for (i = 0; i < n; i++) { // unrolling optimization res[i] += (x[j] * macc[j][i] + x[j+1] * macc[j+1][i]) + (x[j+2] * macc[j+2][i] + x[j+3] * macc[j+3][i]); } // The parentheses are somehow important. Try without. } for (; j < m; j++) { for (i = 0; i < n; i++) { res[i] += x[j] * macc[j][i]; } } return res; } require(microbenchmark) print(microbenchmark( PROD1 = test1 <- prod1(X@address, y), PROD4 = test2 <- prod4(X@address, y), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## PROD1 609.0916 612.6428 613.7418 613.3740 616.4907 617.1096 5 ## PROD4 262.2658 267.7352 267.0268 268.0026 268.0785 269.0521 5 print(all.equal(test1, test2)) ## [1] TRUE

    Nice! Let’s try more. Why not using 8 or 16 rather than 4?

    Rcpp::sourceCpp('https://privefl.github.io/blog/code/prods.cpp') print(bench <- microbenchmark( PROD1 = prod1(X@address, y), PROD2 = prod2(X@address, y), PROD4 = prod4(X@address, y), PROD8 = prod8(X@address, y), PROD16 = prod16(X@address, y), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## PROD1 620.9375 627.9209 640.6087 631.1818 659.4236 663.5798 5 ## PROD2 407.6275 418.1752 417.1746 418.4589 419.0665 422.5451 5 ## PROD4 267.1687 271.4726 283.1928 271.9553 279.6698 325.6979 5 ## PROD8 241.5542 242.9120 255.4974 246.5218 267.7683 278.7307 5 ## PROD16 212.4335 213.5228 217.4781 217.1801 221.5119 222.7423 5 time <- summary(bench)[, "median"] step <- 2^(0:4) plot(step, time, type = "b", xaxt = "n", yaxt = "n", xlab = "size of each step") axis(side = 1, at = step) axis(side = 2, at = round(time))

    Conclusion

    We have seen that unrolling can dramatically improve performances on loops. Steps of size 8 or 16 are of relatively little extra gain compared to 2 or 4.

    As pointed out in the SO answer, it can behave rather differently between systems. So, if it is for your personal use, use the maximum gain (try 32!), but as I want my function to be used by others in a package, I think it’s safer to choose a step of 4.

    Appendix

    You can do a big.matrix-vector multiplication easily with RcppEigen or RcppArmadillo (see this code) but it lacks of efficient subsetting option.

    Indeed, you still can’t use subsetting in Eigen, but this will come as said in this feature request. For Armadillo, you can but it is rather slow:

    Rcpp::sourceCpp('https://privefl.github.io/blog/code/prods2.cpp') n <- nrow(X) ind <- sort(sample(n, size = n/2)) print(microbenchmark( EIGEN = test3 <- prodEigen(X@address, y), ARMA = test4 <- prodArma(X@address, y), ARMA_SUB = test5 <- prodArmaSub(X@address, y, ind - 1), times = 5 )) ## Unit: milliseconds ## expr min lq mean median uq max neval ## EIGEN 567.5607 570.1843 717.2433 572.9402 576.2028 1299.3285 5 ## ARMA 1242.3581 1263.8803 1329.1212 1264.7070 1284.5612 1590.0993 5 ## ARMA_SUB 455.1174 457.5862 466.3982 461.5883 465.9056 491.7935 5 print(all( all.equal(test3, test), all.equal(as.numeric(test4), test), all.equal(as.numeric(test5), test[ind]) )) ## [1] TRUE

    To leave a comment for the author, please follow the link and comment on their blog: Florian Privé. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Combine choropleth data with raster maps using R

    Wed, 2016-12-28 15:44

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Switzerland is a country with lots of mountains, and several large lakes. While the political subdivisions (called municipalities) cover the high mountains and lakes, nothing much of economic interest happens in these places. (Raclette and sailing are wonderful, but don't count for our purposes.) For this reason, the Swiss Federal Statistical Office publishes the boundaries of the "productive" parts of the municipalities, and as this choropleth of average age in Swiss municpalities created by Timo Grossenbacher shows, leaving out the non-productive parts leaves us with a very different-looking Switzerland.

    The choropleth would be more recognizable by filling in the non-productive areas with a traditional relief map, which is exactly what Timo does (along with breaking the age scale into discrete categories, for improved interpretability) in the publication-quality map below.

    Timo's blog post, Beautiful thematic maps with ggplot2 (only), details the process of building maps like this using the ggplot2 package (and just a few others) for R. There are lots of useful nuggets of advice within the tutorial, including:

    • To run scripts in a "clean" R session, to avoid conflicts with packages and objects that happen to be hanging around. Timo suggests unloading packages and removing objects, but a quicker and easier way is to simply launch a new R session with "R --vanilla". The --vanilla option prevents R from running any initialization scripts (that might load packages) or loading any objects from a saved workspace.
    • To import geographic boundaries using the readOGR function, and the use of coord_equal to display them as a map without distortion.
    • To choose a useful color scale for continuous variables (like age) with the viridis package, and how to discretize them into buckets to improve visibility of regional differences.
    • To define a ggplot2 theme according to your presentation style guide (here, a light grey background, a specific font, and no grid lines).
    • Importing a TIFF relief map of Switzerland using the raster function, and overlaying part of it onto the choropleth by the clever trick of making the non-mountainous parts transparent.

    For the complete tutorial, including links to the code and data, check out Timo's blog post linked below.

    Timo Grossenbacher: Beautiful thematic maps with ggplot2 (only)

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Exploratory Data Analysis Using R (Part-I)

    Wed, 2016-12-28 13:19

    (This article was first published on R Language in Datazar on Medium, and kindly contributed to R-bloggers)

    The greatest value of a picture is when it forces us to notice what we never expected to see. — John W. Tukey. Exploratory Data Analysis.

    Why do we use exploratory graphs in data analysis?
    • Understand data properties
    • Find patterns in data
    • Suggest modeling strategies
    • “Debug” analyses

    Data –We will use the air-quality dataset available in R for our analysis.The entire project can be found here. You can go and try it for yourself by running it on Datazar.

    library(datasets)
    head(airquality) Summaries of Data

    One dimensional Data– Univariate EDA for a quantitative variable is a way to make preliminary assessments about the population distribution of the variable using the data of the observed sample.

    When we are dealing with a single datapoint, let’s say temperature or, wind speed, or age, the following techniques are used for the initial exploratory data analysis.

    • Five-number summary- This essantially provides information about the minimum value, 1st quartile, median, 3rd quartile and the maximum.
    summary(airquality$Wind)
    Summary Of Windspeed
    • Boxplots– boxplot consists of a rectangular box bounded above and below by “hinges” that represent the quartiles Q3 and Q1 respectively, and with a horizontal “median” line through it. You can also see the upper and lower “whiskers”, and a point marking a potential “outlier”.

    IQR (interquartile range) = Q3 — Q1, (the box in the plot)

    whiskers = ±1.58IQR/√ n ∗ IQR, where n is the number of samples. (datapoints)

    boxplot(airquality$Wind~airquality$Month,col=”purple”)
    Wind Speed by Month
    • Histograms- The most basic graph is the histogram, which is a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values. Typically the bars run vertically with the count (or proportion) axis running vertically. To manually construct a histogram, define the range of data for each bar (called a bin), count how many cases fall in each bin, and draw the bars high enough to indicate the count.
    hist(airquality$Wind,col=”gold”)
    rug(airquality$Wind)#(Optional)plots the point below in a histogram
    • Barplot- A bar chart is made up of columns or rows plotted on a graph. Here is how to read a bar chart made up of columns.
    • The columns are positioned over a label that represents a categorical variable .
    • The height of the column indicates the size of the group defined by the column label.
    • A bar chart is used for when you have categories of data: Types of movies, music genres, or dog breeds.Hence, a bar chart is used (and not histogram) when we are dealing with categorical variables.
    barplot(table(chickwts$feed),col = “wheat”, main=”Number Of Chickens by diet type”)

    Two dimensional Data– Multivariate non-graphical EDA techniques generally show the relationship between two or more variables in the form of either cross-tabulation or statistics.

    • Scatter Plot- This essantially provides information about the minimum value, 1st quartile, median, 3rd quartile and the maximum.

    For two quantitative variables, the basic graphical EDA technique is the scatterplot which has one variable on the x-axis, one on the y-axis and a point for each case in your dataset. If one variable is explanatory and the other is outcome, it is a very, very strong convention to put the outcome on the y (vertical) axis.

    One or two additional categorical variables can be accommodated on the scatterplot by encoding the additional information in the symbol type and/or color.

    We will use the Males.csv dataset (present in the project on Datazar, to check whether being a part of an union impacts the salaries of young american males.

    males<-read.csv(“dataset0.csv”)
    head(males)
    samplemales<- males[1:100,] # we used first 100 rows with(samplemales ,plot(exper,wage, col= union))
    #union is a categorical variable represented by color
    Scatter plot to represent age vs experience (the color represent whether the employee is a part of an union)

    We can also use multiple scatter plots to understand better, whether being part of an union impacts an employees salary.

    We can see that, most employees are not part of an union and they tend to earn more than employees who are a part of an union.Correlation doesn’t always mean causation, as it might be the case, the high paying industries do not allow their employees to form unions.

    In a nutshell: You should always perform appropriate EDA before further analysis of your data

    Lastly, I wish you all a merry Christmas and a very happy new year. I will come back with the next edition of EDA in New Year. Till then, happy modeling!


    Exploratory Data Analysis Using R (Part-I) was originally published in Datazar on Medium, where people are continuing the conversation by highlighting and responding to this story.

    To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Celebrating our 100th R exercise set

    Wed, 2016-12-28 12:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Yesterday we published our 100th set of exercises on R-exercises. Kudos and many thanks to Avi, Maria Elisa, Euthymios, Francisco, Imtiaz, John, Karolis, Mary Anne, Matteo, Miodrag, Paritosh, Sammy, Siva, Vasileios, and Walter for contributing so much great material to practice R programming! Even more thanks to Onno, who is working (largely) behind the scenes to get everything working smoothly.

    I thought perhaps this would be a good time to share some thoughts on the ideas behind the site, and how to proceed from this point onward. The main idea is pretty simple: it helps to practice if you want to learn R programming.

    The two problems we’re trying to solve

    Although the idea itself is simple, for many people, and perhaps you as well, following up on this idea is a challenge. For example, practicing R programming requires a certain task that has to be completed, a solution to an analytical problem that has to be found, or broader goal definition. Without this, we would just be typing random R syntax, or copy-paste code we found somewhere on the web, which will contribute little to improving our R skills. The main problem R-exercises is trying to solve is how to specify these tasks, problems and goals in a useful, creative and structured way. The exercise sets are our (current) solution to this problem.

    But there’s a second challenge for those who want to practice: Staying focused. Live throws many distractions at us and while you perhaps found some interesting problems to practice your R skills, sooner or later practicing fades away when more urgent matters pop up. So, the second problem R-exercises is trying to solve is how to practice in a focused, persistent way. Offering new exercises on a daily basis, rather than one-time communication (e.g. a book or course) is our solution to this second problem.

    Filling a gap in existing solutions

    Is there a need for a site filled with exercises? There is an enormous amount of educational material on R available already. Our Course Finder directory includes 140 R courses, offered on 17 different platforms. Many universities teach R as part of their methods/statistics course programs. There are plenty of books on R. A search for “tutorial” on blog aggregator R-bloggers, reveals 1783 articles. And then there’s Youtube. It seems, with so much material, gaps are unlikely. But are they?

    Going back to the two challenges we just described, we think what we’re offering is complementary to courses, books, classes and tutorials. Because the focus of most courses, books, classes and tutorials is on explaining/demonstrating things instead of practicing (the first challenge). And their focus is temporary, not necessarily persistent (the second challenge). It’s gone after you completed the course, read the book or watched the video tutorial.

    In their excellent book “Make it stick”, Roediger and McDaniel explain that many of our intuitive approaches to learning (e.g. rereading a text) are unproductive. Instead they advise: “One of the best habits a learner can instill in herself is regular self-quizzing to recalibrate her understanding of what she does and does not know.” From this perspective, R-exercises can help you to recalibrate your understanding of what you know and don’t know about R.

    The next 100 sets

    We’re committed to keep expanding R-exercises, and adding more exercise sets. A while ago we started to differentiate sets in terms of difficulty (beginner, intermediate and advanced), an idea that many readers seemed to like when we proposed it. Recently we started to include information about online courses directly related to the exercises in a set, so for those who want to learn more, it’s easy to find a relevant course quickly.

    Another idea we have is to offer premium (paid) memberships, with access to more extensive learning materials related to each exercise set. We’d actually love to hear your suggestions on how we can improve and expand R-exercises. What would you like to see on the site in 2017?

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Authentication Proxy on Shiny Open Source

    Wed, 2016-12-28 10:02

    (This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers)

    A year ago i wrote about a way to authenticate shiny with Auth0, using Apache: http://blog.datascienceheroes.com/adding-authentication-to-shiny-open-source-edition/

    This method works but has some issues, Sebastian Peyrott has written an excellent new blogpost that explains how to add authentication to the Open Source edition of Shiny from scratch, using a node.js proxy and Nginx.

    With this you’ll be able to serve internal reports without going to an expensive solution or doing everything from scratch. You can read the blogpost at:
    https://auth0.com/blog/adding-authentication-to-shiny-server/

    To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    R code to accompany Real-World Machine Learning (Chapters 2-4 Updates)

    Wed, 2016-12-28 01:00

    (This article was first published on data prone - R, and kindly contributed to R-bloggers)

    Abstract

    I updated the R code to accompany Chapter 2-4 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf to be more consistent with the listings and figures as presented in the book.

    rwml-R Chapters 2-4 updated

    The most notable changes to rwml-R are for Chapter 4, where
    multiple ROC curves are
    plotted for a 10-class classifier and a tile plot is generated for
    a tuning parameter grid search.
    Also, for parallel computations, the doMC package was replaced with
    doParallel.

    Plotting a series of ROC curves

    To be consistent with the approach followed in the book, I’ve added listings
    of R code to compute the
    ROC curves and AUC values “from scratch” instead of using the ROCR
    package as was done previously:

    Tuning model parameters in Chapter 4

    The caret package is used to tune parameters via grid search
    for the Support Vector Machines model with a Radial Basis Function Kernel.
    By setting summaryFunction = twoClassSummary
    in trainControl, the ROC curve is used to select the optimal
    model. For consistency with the book, tile plots were added to illustrate the
    process of refining
    the grid for the parameter search. The tile plot for the second (refined)
    grid search is below.

    Feedback welcome

    If you have any feedback on the rwml-R project, please
    leave a comment below or use the Tweet button.
    As with any of my projects, feel free to fork the rwml-R repo
    and submit a pull request if you wish to contribute.
    For convenience, I’ve created a project page for rwml-R with
    the generated HTML files from knitr.

    Download
    Fork

    To leave a comment for the author, please follow the link and comment on their blog: data prone - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Behind the scenes of CRAN

    Tue, 2016-12-27 22:10

    (This article was first published on R-Bloggers – H2O.ai Blog, and kindly contributed to R-bloggers)

    (Just from my point of view as a package maintainer.)New users of R might not appreciate the full benefit of CRAN and new package maintainers may not appreciate the importance of keeping their packages updated and free of warnings and errors. This is something I only came to realize myself in the last few years […]

    To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – H2O.ai Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    More on Orthogonal Regression

    Tue, 2016-12-27 17:28

    (This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers)

    Some time ago I wrote a post about orthogonal regression. This is where we fit a regression line so that we minimize the sum of the squares of the orthogonal (rather than vertical) distances from the data points to the regression line. Subsequently, I received the following email comment:

    “Thanks for this blog post. I enjoyed reading it. I’m wondering how straightforward you think this would be to extend orthogonal regression to the case of two independent variables? Assume both independent variables are meaningfully measured in the same units.”

    Well, we don’t have to make the latter assumption about units in order to answer this question. And we don’t have to limit ourselves to just two regressors. Let’s suppose that we have p of them. In fact, I hint at the answer to the question posed above towards the end of my earlier post, when I say, “Finally, it will come as no surprise to hear that there’s a close connection between orthogonal least squares and principal components analysis.”

    What was I referring to, exactly?
    Well, just recall how we define the Principal Components of a multivariate set of data. Suppose that the data are in the form of an (n x p) matrix, X. There are n observations, and p variables. An orthogonal transformation is applied to X. This results in r (le p) new variables that are linearly uncorrelated.  These are the principal components (PC’s) of the data, and they are ordered as follow. The first PC accounts for the most of the variability in the original data. The second PC accounts for the maximum amount of the remaining variability in the data, subject to the constraint that it is uncorrelated with (i.e., orthogonal to) the first PC. 

    Note how orthogonality has crept into the story!

    We then continue – the third PC accounts for the maximum amount of the remaining variability in the data, subject to the constraint that it is orthogonal to both the first and second PC’s. etc.

    You’ll find examples of PC analysis being used in a statistically descriptive way in some earlier posts of mine – e.g., here and here.

    We can use (some of) the PC’s of the regressor data as explanatory variables in a regression model. A useful reference for this can be found here. Note that, by construction, these transformed explanatory variables will have zero multicollinearity.

    So, in the multivariate case, orthogonal regression is just least squares regression using a sub-set of the principal components of the original regressor matrix as the explanatory variables. We also sometimes call it Total Least Squares.

    In this earlier post I talked about using Principal Components Regression (PCR) in the context of simultaneous equations models. The problem there was that we can’t construct the 2SLS estimator if the sample size is smaller than the total number of predetermined variables in the entire system. (This used to be referred to as the “under-sized sample” problem.) One solution was to use a few of the principal components of the matrix of data on the predetermined variables, instead of all of the latter variables, at the first stage of 2SLS. (Usually, just the first few principal components will capture almost all of the variability in the original data.)

    There are some useful discussions of this that you might want to refer to. For instance, Vincent Zoonekynd has a nice illustration here. I particularly recommend two other pieces that discuss PCR using R – this post, “Principal components regression in R, an operational tutorial”, by John Mount, on the Revolutions blog; and this post, “Performing principal components regression (PCR) in R”, by Michy Alice, on the Quantide site.

    PCR also gets a brief mention in this earlier post of mine – see the discussion of the last paper mentioned in that post. So, the bottom line is that while my introductory post dealt with just the single-regressor case, it’s straightforward to apply orthogonal multiple regression – it’s just regression using the first few principal components of the  regressor matrix.

    © 2016, David E. Giles

    To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    R For Beginners: Some Simple R Code to do Common Statistical Procedures, Part Two

    Tue, 2016-12-27 17:05

    (This article was first published on r – R Statistics and Programming, and kindly contributed to R-bloggers)

    An R tutorial by D. M. Wiig

    This posting contains an embedded Word document. To view the document full screen click on the icon in the lower right hand corner of the embedded document.

     

     

    To leave a comment for the author, please follow the link and comment on their blog: r – R Statistics and Programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs

    Analyzing the 2015 California Health Interview Survey in R

    Tue, 2016-12-27 14:04

    (This article was first published on quantitate, and kindly contributed to R-bloggers)

    .gist-file .gist-data {max-height: 500px;}

    A few years ago, I wrote about how to analyze the 2012 California Health Interview Survey in R. In 2012, plans for Covered California (Obamacare in California) were just beginning to take shape. Today, Covered California is a relatively mature program and it is arguably the most successful implementation of the Affordable Care Act in the United States. This month, UCLA’s Center for Health Policy released the 2015 California Health Interview Survey (CHIS for short). With this fantastic new data set, we can measure the impact of Covered California in its second year. In this brief post, I’ll review the basics of the working with CHIS data in R by way of a simple example. My hope is to inspire other R users to dive into this unique data set.

    The CHIS Quickstart guide for R

    Though CHIS is a complex survey, it’s simple to work with CHIS data in R. Here’s how to get started:

    • Head over to the CHIS site and create an account.
    • Once you’ve created a free account and accepted a bunch of terms of use agreements, you’ll be able to download the CHIS public use data files. You’ll want to download the Stata .dta versions, as these are the easiest to work with using R’s foreign package.
    • CHIS data is divided into 3 groups, child, adolescent, and adult. We’ll work with the adult data below.
    • You’ll also want to download the appropriate data dictionary for your data set. The dictionary provides excellent documentation about the hundreds of variables covered by CHIS. If it’s your first time working with CHIS, I recommend a quick skim of the entire dictionary to get a sense of the kinds of things covered by the survey.

    Once you’ve downloaded the data, to bring it into R you can use the foreign package:
    # Read CHIS file
    library(foreign)
    file <- "~/projects/CHIS/chis15_adult_stata/Data/ADULT.dta" # your file
    CHIS <- read.dta(file, convert.factors = TRUE)

    The most important thing to understand about CHIS data is how to use the replicate weights RAKEDW0-RAKED80. I covered the use of replicate weights in detail in this post. The important points about replicate weights in CHIS are:

    • Use RAKEDW0 for estimating means and counts in CHIS. RAKEDW0 is designed so that it’s sum across all rows in the CHIS data is equal to the total non-institutionalized adult population of California.
    • Use RAKEDW1-RAKED80 for estimating variances as described here.

    As an example, let’s start by getting counts of health insurance coverage by type. For this we have two insurance type variable INSTYPE and the new INS9TP which gives a more detailed breakdown of insurance types.
    # tabulate the data
    print(as.data.frame(xtabs(rakedw0~instype, CHIS, drop.unused.levels = TRUE)))
    # instype Freq
    # 1 UNINSURED 2910380.5
    # 2 MEDICARE & MEDICAID 1561496.7
    # 3 MEDICARE & OTHERS 1646743.6
    # 4 MEDICARE ONLY 2129841.7
    # 5 MEDICAID 6239539.9
    # 6 EMPLOYMENT-BASED 12193686.7
    # 7 PRIVATELY PURCHASED 1985807.9
    # 8 OTHER PUBLIC 415154.8

    One interesting health behavior that CHIS tracks is fast food consumption. To create the variable AC31, CHIS asked respondents about the number of times they ate fast food in the past week. This simple script explores how fast food consumption behavior interacts with health insurance coverage type:

    Already with this superficial analysis we can see some interesting things. First we notice that the uninsured are eating fast food more often than the non-Medicaid insured. The uninsured’s fast food behavior looks quite similar to the Medicaid population while the fast food behaviour of the employment-based insured resembles the behaviour of the private purchase group. And most importantly, everyone is eating too much fast food.

    Conclusion

    I hope this simple example inspires you to investigate CHIS data on your own. I think it would be especially interesting to see some further analysis of the nearly 3 million Californians who remain uninsured despite the relative success of Covered California. Some interesting background research on this topic can be found here and here. Feel free to get in touch if you are working with CHIS data to improve public health in California.

    To leave a comment for the author, please follow the link and comment on their blog: quantitate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Categories: Methodology Blogs