R bloggers

Syndicate content
R news and tutorials contributed by (573) R bloggers
Updated: 5 hours 28 min ago

Simulating Continuous-Time Markov Chains with simmer (part 2)

Mon, 2016-04-25 03:00

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)

MathJax.Ajax.config.path["Contrib"] = "https://cdn.mathjax.org/mathjax/contrib"; MathJax.Hub.Config({ TeX: {extensions: ["[Contrib]/xyjax/xypic.js","AMSmath.js","AMSsymbols.js"]}, tex2jax: {inlineMath: [["$","$"],["\(","\)"]]} });

In part one, we simulated a simple CTMC. Now, let us complicate things a bit. Remember the example problem there:

A gas station has a single pump and no space for vehicles to wait (if a vehicle arrives and the pump is not available, it leaves). Vehicles arrive to the gas station following a Poisson process with a rate of $lambda=3/20$ vehicles per minute, of which 75% are cars and 25% are motorcycles. The refuelling time can be modelled with an exponential random variable with mean 8 minutes for cars and 3 minutes for motorcycles, that is, the services rates are $mu_mathrm{c}=1/8$ cars and $mu_mathrm{m}=1/3$ motorcycles per minute respectively (note that, in this context, $mu$ is a rate, not a mean).

Consider the previous example, but, this time, there is space for one motorcycle to wait while the pump is being used by another vehicle. In other words, cars see a queue size of 0 and motorcycles see a queue size of 1.

The new Markov chain is the following:

% [o][F]{car+} ar@(r,u)[drr]^{mu_mathrm{c}} \ *=<15mm,8mm>[o][F]{car} ar@/_/[r]_{mu_mathrm{c}} ar@/^/[u]^{(1-p)lambda} & *=<15mm,8mm>[o][F]{empty} ar@/_/[l]_{plambda} ar@/^/[r]^{(1-p)lambda} & *=<15mm,8mm>[o][F]{m/cycle} ar@/^/[l]^{mu_mathrm{m}} ar@/_/[d]_{(1-p)lambda} \ & & *=<15mm,8mm>[o][F]{m/c+} ar@/_/[u]_{mu_mathrm{m}} }} %]]>

%

where the states car+ and m/c+ represent car + waiting motorcycle and motorcycle + waiting motorcycle respectively.

With $p$ the steady state distribution, the average number of vehicles in the system is given by

N = 2(p_1 + p_5) + p_2 + p_4

# Arrival rate lambda <- 3/20 # Service rate (cars, motorcycles) mu <- c(1/8, 1/3) # Probability of car p <- 0.75 # Theoretical resolution A <- matrix(c(1, 0, 0, mu[1], 0, 1, -(1-p)*lambda-mu[1], mu[1], 0, 0, 1, p*lambda, -lambda, (1-p)*lambda, 0, 1, 0, mu[2], -(1-p)*lambda-mu[2], (1-p)*lambda, 1, 0, 0, mu[2], -mu[2]), byrow=T, ncol=5) B <- c(1, 0, 0, 0, 0) P <- solve(t(A), B) N_average_theor <- sum(P * c(2, 1, 0, 1, 2)) ; N_average_theor ## [1] 0.6349615

As in the previous post, we can simulate this chain by breaking down the problem into two trajectories (one for each type of vehicle and service rate) and two generators. But in order to disallow cars to stay in the pump’s queue, we need to introduce a little trick in the cars’ seize: the argument amount is a function that returns 1 if the pump is vacant and 2 otherwise. This implies that the car gets rejected, because there is only one position in queue and that seize is requesting two positions. Note also that the environment env must be defined before running the simulation, as it is needed inside the trajectory.

library(simmer) set.seed(1234) option.1 <- function(t) { car <- create_trajectory() %>% seize("pump", amount=function() { if (env %>% get_server_count("pump")) 2 # rejection else 1 # serve }) %>% timeout(function() rexp(1, mu[1])) %>% release("pump", amount=1) mcycle <- create_trajectory() %>% seize("pump", amount=1) %>% timeout(function() rexp(1, mu[2])) %>% release("pump", amount=1) env <- simmer() %>% add_resource("pump", capacity=1, queue_size=1) %>% add_generator("car", car, function() rexp(1, p*lambda)) %>% add_generator("mcycle", mcycle, function() rexp(1, (1-p)*lambda)) env %>% run(until=t) }

The same idea using a branch, with a single generator and a single trajectory.

option.2 <- function(t) { vehicle <- create_trajectory() %>% branch(function() sample(c(1, 2), 1, prob=c(p, 1-p)), c(F, F), create_trajectory("car") %>% seize("pump", amount=function() { if (env %>% get_server_count("pump")) 2 # rejection else 1 # serve }) %>% timeout(function() rexp(1, mu[1])) %>% release("pump", amount=1), # always 1 create_trajectory("mcycle") %>% seize("pump", amount=1) %>% timeout(function() rexp(1, mu[2])) %>% release("pump", amount=1)) env <- simmer() %>% add_resource("pump", capacity=1, queue_size=1) %>% add_generator("vehicle", vehicle, function() rexp(1, lambda)) env %>% run(until=t) }

We may also avoid messing up things with branches and subtrajectories. We can decide the type of vehicle and set it as an attribute of the arrival with set_attribute. Then, every activity’s function is able to retrieve those attributes as a named list. Although the branch option is a little bit faster, this one is nicer, because there are no subtrajectories involved.

option.3 <- function(t) { vehicle <- create_trajectory("car") %>% set_attribute("vehicle", function() sample(c(1, 2), 1, prob=c(p, 1-p))) %>% seize("pump", amount=function(attrs) { if (attrs["vehicle"] == 1 && env %>% get_server_count("pump")) 2 # car rejection else 1 # serve }) %>% timeout(function(attrs) rexp(1, mu[attrs["vehicle"]])) %>% release("pump", amount=1) # always 1 env <- simmer() %>% add_resource("pump", capacity=1, queue_size=1) %>% add_generator("vehicle", vehicle, function() rexp(1, lambda)) env %>% run(until=t) }

But if performance is a requirement, we can play cleverly with the resource’s capacity and queue size, and with the amounts requested in each seize, in order to model the problem without checking the status of the resource. Think about this:

  • A resource with capacity=3 and queue_size=2.
  • A car always tries to seize amount=3.
  • A motorcycle always tries to seize amount=2.

In these conditions, we have the following possibilities:

  • Pump empty.
  • One car (3 units) in the server [and optionally one motorcycle (2 units) in the queue].
  • One motorcycle (2 units) in the server [and optionally one motorcycle (2 units) in the queue].

Just as expected! So, let’s try:

option.4 <- function(t) { vehicle <- create_trajectory() %>% branch(function() sample(c(1, 2), 1, prob=c(p, 1-p)), c(F, F), create_trajectory("car") %>% seize("pump", amount=3) %>% timeout(function() rexp(1, mu[1])) %>% release("pump", amount=3), create_trajectory("mcycle") %>% seize("pump", amount=2) %>% timeout(function() rexp(1, mu[2])) %>% release("pump", amount=2)) simmer() %>% add_resource("pump", capacity=3, queue_size=2) %>% add_generator("vehicle", vehicle, function() rexp(1, lambda)) %>% run(until=t) }

We are still wasting time in the branch decision. We can mix this solution above with the option.1 to gain extra performance:

option.5 <- function(t) { car <- create_trajectory() %>% seize("pump", amount=3) %>% timeout(function() rexp(1, mu[1])) %>% release("pump", amount=3) mcycle <- create_trajectory() %>% seize("pump", amount=2) %>% timeout(function() rexp(1, mu[2])) %>% release("pump", amount=2) simmer() %>% add_resource("pump", capacity=3, queue_size=2) %>% add_generator("car", car, function() rexp(1, p*lambda)) %>% add_generator("mcycle", mcycle, function() rexp(1, (1-p)*lambda)) %>% run(until=t) }

Options 1, 2 and 3 are slower, but they give us the correct numbers, because the parameters (capacity, queue size, amounts) in the model remain unchanged compared to the problem. For instance,

gas.station <- option.1(5000) library(ggplot2) # Evolution + theoretical value graph <- plot_resource_usage(gas.station, "pump", items="system") graph + geom_hline(yintercept=N_average_theor)

However, it is not the case in options 4 and 5. The parameters of these models have been adulterated to fit our performance purposes. Therefore, we need to extract the RAW data, rescale the numbers and plot them. And, of course, we get the same figure:

gas.station <- option.5(5000) limits <- data.frame(item = c("queue", "server", "system"), value = c(1, 1, 2)) library(dplyr); library(tidyr) graph <- gas.station %>% get_mon_resources() %>% gather(item, value, server, queue, system) %>% mutate(value = round(value * 2/5), # rescaling here <------ item = factor(item)) %>% filter(item %in% "system") %>% group_by(resource, replication, item) %>% mutate(mean = c(0, cumsum(head(value, -1) * diff(time))) / time) %>% ungroup() %>% ggplot() + aes(x=time, color=item) + geom_line(aes(y=mean, group=interaction(replication, item))) + ggtitle("Resource usage: pump") + ylab("in use") + xlab("time") + expand_limits(y=0) + geom_hline(aes(yintercept=value, color=item), limits, lty=2) graph + geom_hline(yintercept=N_average_theor)

Finally, these are some performance results:

library(microbenchmark) t <- 1000/lambda tm <- microbenchmark(option.1(t), option.2(t), option.3(t), option.4(t), option.5(t)) graph <- autoplot(tm) graph + scale_y_log10(breaks=function(limits) pretty(limits, 5)) + ylab("Time [milliseconds]")

To leave a comment for the author, please follow the link and comment on their blog: FishyOperations. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Candlestick charts using Plotly and Quantmod

Mon, 2016-04-25 00:57

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

This post is dedicated to creating candlestick charts using Plotly’s R-API.

For more information on candlestick charts visit www.stockcharts.com.

We’ll also showcase Plotly’s awesome new range selector feature !

plotlyCandleStick <- function(symbol = "MSFT", fillcolor = "#ff6666", hollowcolor = "#39ac73", linewidth = 4, plotcolor = "#3E3E3E", papercolor = "#1E2022", fontcolor = "#B3A78C", startdate = "2015-01-01"){ # Get OHLC prices using quantmod prices <- getSymbols(symbol, auto.assign = F) prices <- prices[index(prices) >= startdate] # Convert to dataframe prices <- data.frame(time = index(prices), open = as.numeric(prices[,1]), high = as.numeric(prices[,2]), low = as.numeric(prices[,3]), close = as.numeric(prices[,4]), volume = as.numeric(prices[,5])) # Create line segments for high and low prices plot.base <- data.frame() plot.hollow <- data.frame() plot.filled <- data.frame() for(i in 1:nrow(prices)){ x <- prices[i, ] # For high / low mat <- rbind(c(x[1], x[3]), c(x[1], x[4]), c(NA, NA)) plot.base <- rbind(plot.base, mat) # For open / close if(x[2] > x[5]){ mat <- rbind(c(x[1], x[2]), c(x[1], x[5]), c(NA, NA)) plot.filled <- rbind(plot.filled, mat) }else{ mat <- rbind(c(x[1], x[2]), c(x[1], x[5]), c(NA, NA)) plot.hollow <- rbind(plot.hollow, mat) } } colnames(plot.base) <- colnames(plot.hollow) <- colnames(plot.filled) <- c("x", "y") plot.base$x <- as.Date(as.numeric(plot.base$x)) plot.hollow$x <- as.Date(as.numeric(plot.hollow$x)) plot.filled$x <- as.Date(as.numeric(plot.filled$x)) hovertxt <- paste("Date: ", round(prices$time,2), "<br>", "High: ", round(prices$high,2),"<br>", "Low: ", round(prices$low,2),"<br>", "Open: ", round(prices$open,2),"<br>", "Close: ", round(prices$close,2)) # Base plot for High / Low prices p <- plot_ly(plot.base, x = x, y = y, mode = "lines", marker = list(color = '#9b9797'), line = list(width = 1), showlegend = F, hoverinfo = "none") # Trace for when open price > close price p <- add_trace(p, data = plot.filled, x = x, y = y, mode = "lines", marker = list(color = fillcolor), line = list(width = linewidth), showlegend = F, hoverinfo = "none") # Trace for when open price < close price p <- add_trace(p, data = plot.hollow, x = x, y = y, mode = "lines", marker = list(color = hollowcolor), line = list(width = linewidth), showlegend = F, hoverinfo = "none") # Trace for volume p <- add_trace(p, data = prices, x = time, y = volume/1e6, type = "bar", marker = list(color = "#ff9933"), showlegend = F, hoverinfo = "x+y", yaxis = "y2") # Trace for hover info p <- add_trace(p, data = prices, x = time, y = high, opacity = 0, hoverinfo = "text", text = hovertxt, showlegend = F) # Layout options p <- layout(p, xaxis = list(title = "", showgrid = F, tickformat = "%b-%Y", tickfont = list(color = fontcolor), rangeselector = list( x = 0.85, y = 0.97, bgcolor = "fontcolor", buttons = list( list( count = 3, label = "3 mo", step = "month", stepmode = "backward"), list( count = 6, label = "6 mo", step = "month", stepmode = "backward"), list( count = 1, label = "1 yr", step = "year", stepmode = "backward"), list( count = 1, label = "YTD", step = "year", stepmode = "todate"), list(step = "all")))), yaxis = list(title = "Price", gridcolor = "#8c8c8c", tickfont = list(color = fontcolor), titlefont = list(color = fontcolor), domain = c(0.30, 0.95)), yaxis2 = list(gridcolor = "#8c8c8c", tickfont = list(color = fontcolor), titlefont = list(color = fontcolor), side = "right", domain = c(0, 0.2)), paper_bgcolor = papercolor, plot_bgcolor = plotcolor, margin = list(r = 50, t = 50), annotations = list( list(x = 0.02, y = 0.25, text = "Volume(mil)", ax = 0, ay = 0, align = "left", xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", font = list(size = 20, color = fontcolor)), list(x = 0, y = 1, text = symbol, ax = 0, ay = 0, align = "left", xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", font = list(size = 20, color = fontcolor)), list(x = 0.1, y = 1, text = paste("Start: ", format(min(prices$time), "%b-%Y"), "<br>End: ", format(max(prices$time), "%b-%Y")), ax = 0, ay = 0, align = "left", xref = "paper", yref = "paper", xanchor = "left", yanchor = "top", font = list(size = 10, color = fontcolor)) )) return(p) }

library(plotly) library(quantmod) plotlyCandleStick("TSLA")


To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Create Amazing Looking Backtests With This One Wrong–I Mean Weird–Trick! (And Some Troubling Logical Invest Results)

Fri, 2016-04-22 16:14

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will outline an easy-to-make mistake in writing vectorized backtests–namely in using a signal obtained at the end of a period to enter (or exit) a position in that same period. The difference in results one obtains is massive.

Today, I saw two separate posts from Alpha Architect and Mike Harris both referencing a paper by Valeriy Zakamulin on the fact that some previous trend-following research by Glabadanidis was done with shoddy results, and that Glabadanidis’s results were only reproducible through instituting lookahead bias.

The following code shows how to reproduce this lookahead bias.

First, the setup of a basic moving average strategy on the S&P 500 index from as far back as Yahoo data will provide.

require(quantmod) require(xts) require(TTR) require(PerformanceAnalytics) getSymbols('^GSPC', src='yahoo', from = '1900-01-01') monthlyGSPC <- Ad(GSPC)[endpoints(GSPC, on = 'months')] # change this line for signal lookback movAvg <- SMA(monthlyGSPC, 10) signal <- monthlyGSPC > movAvg gspcRets <- Return.calculate(monthlyGSPC)

And here is how to institute the lookahead bias.

lookahead <- signal * gspcRets correct <- lag(signal) * gspcRets

These are the “results”:

compare <- na.omit(cbind(gspcRets, lookahead, correct)) colnames(compare) <- c("S&P 500", "Lookahead", "Correct") charts.PerformanceSummary(compare) rbind(table.AnnualizedReturns(compare), maxDrawdown(compare), CalmarRatio(compare)) logRets <- log(cumprod(1+compare)) chart.TimeSeries(logRets, legend.loc='topleft')

Of course, this equity curve is of no use, so here’s one in log scale.

As can be seen, lookahead bias makes a massive difference.

Here are the numerical results:

S&P 500 Lookahead Correct Annualized Return 0.0740000 0.15550000 0.0695000 Annualized Std Dev 0.1441000 0.09800000 0.1050000 Annualized Sharpe (Rf=0%) 0.5133000 1.58670000 0.6623000 Worst Drawdown 0.5255586 0.08729914 0.2699789 Calmar Ratio 0.1407286 1.78119192 0.2575219

Again, absolutely ridiculous.

Note that when using Return.Portfolio (the function in PerformanceAnalytics), that package will automatically give you the next period’s return, instead of the current one, for your weights. However, for those writing “simple” backtests that can be quickly done using vectorized operations, an off-by-one error can make all the difference between a backtest in the realm of reasonable, and pure nonsense. However, should one wish to test for said nonsense when faced with impossible-to-replicate results, the mechanics demonstrated above are the way to do it.

Now, onto other news: I’d like to thank Gerald M for staying on top of one of the Logical Invest strategies–namely, their simple global market rotation strategy outlined in an article from an earlier blog post.

Up until March 2015 (the date of the blog post), the strategy had performed well. However, after said date?

It has been a complete disaster, which, in hindsight, was evident when I passed it through the hypothesis-driven development framework process I wrote about earlier.

So, while there has been a great deal written about not simply throwing away a strategy because of short-term underperformance, and that anomalies such as momentum and value exist because of career risk due to said short-term underperformance, it’s never a good thing when a strategy creates historically large losses, particularly after being published in such a humble corner of the quantitative financial world.

In any case, this was a post demonstrating some mechanics, and an update on a strategy I blogged about not too long ago.

Thanks for reading.

NOTE: I am always interested in hearing about new opportunities which may benefit from my expertise, and am always happy to network. You can find my LinkedIn profile here.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R Courses at Newcastle

Fri, 2016-04-22 15:09

(This article was first published on R – Why?, and kindly contributed to R-bloggers)

Over the next two months I’m running a number of R courses at Newcastle University.

  • May 2016
    • May 10th, 11th: Predictive Analytics
    • May 16th – 20th: Bioconductor
    • May 23rd, 24th: Advanced programming
  • June 2016
    • June 8th: R for Big Data
    • June 9th: Interactive graphics with Shiny

Since these courses are on  advanced topics, numbers are limited (there’s only a couple of places left on Predictive Analytics). If you are interested in attending, sign up as soon as possible.

Getting to Newcastle is easy. The airport is 10 minutes from the city centre and has direct flights to the main airport hubs: Schiphol, Heathrow, and Paris.  The courses at Newcastle attract participants from around the world; at the April course, we had representatives from North America, Sweden, Germany,  Romania and Geneva.

Cost: The courses cost around £130 per day (more than half the price of certain London courses!)

 

Onsite courses available on request.

To leave a comment for the author, please follow the link and comment on their blog: R – Why?. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

New: Spanish and French Translations of Introduction to R

Thu, 2016-04-21 21:49

(This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

The team here at DataCamp is thrilled to announce that we now offer free Spanish and French translations of our most popular course, Introduction to R. Best of all, the courses are free as a part of our open course offering! By using in-browser coding challenges you will experiment with the different aspects of the R language in real time, and you will receive instant and personalized feedback that guides you to the solution. All of this, now available in Introducción a R and Introduction à R!

What you’ll learn

This free introduction to R tutorial will help you master the basics of R. In six sections, you will cover its basic syntax, preparing you to undertake your own first data analysis using R. Starting from variables and basic operations, you will learn how to handle data structures such as vectors, matrices, lists and data frames. No prior knowledge in programming or data science is required. In general, the focus is on actively understanding how to code your way through interesting data science tasks.

Create your own course

Want to create your own translation of Introduction to R? With DataCamp Teach, you can easily create and host your own interactive courses for free. Use the same system DataCamp course creators use to develop their courses, and share your R knowledge with the rest of the world. With DataCamp teach you just write your interactive exercises in simple markdown files, and DataCamp teach uploads the content to DataCamp for you. This makes creating a DataCamp course hassle-free.

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

WrightMap Tutorial 4 – More Flexibility Using the person and item side…

Thu, 2016-04-21 19:57

(This article was first published on R Snippets for IRT, and kindly contributed to R-bloggers)

WrightMap Tutorial 4 – More Flexibility Using the person and item side functions Introduction

Version 1.2 of the WrightMap package allows you to directly access the functions used for drawing the person and item sides of the map in order to allow more flexible item person maps. The parts can be put together on the same plot using the split.screen function.

Calling the functions

Let’s start by installing the latest version of the package from CRAN.

install.packages('WrightMap') library(WrightMap)

And set up some item data.

items.loc <- sort( rnorm( 20)) thresholds <- data.frame( l1 = items.loc - 0.5 , l2 = items.loc - 0.25, l3 = items.loc + 0.25, l4 = items.loc + 0.5)

We can draw a simple item map by calling one of the item side functions. Currently there are three: itemModern, itemClassic, and itemHist.

The itemModern function is the default called by wrightMap.

itemModern(thresholds)

The itemClassic function creates item sides inspired by text-based Wright Maps.

itemClassic(thresholds)

Finally, the itemHist function plots the items as a histogram.

itemHist(thresholds)

Similarly, the person side functions allow you to graph the person parameters. There are two, personHist and personDens.

## Mock results multi.proficiency <- data.frame( d1 = rnorm(1000, mean = -0.5, sd = .5), d2 = rnorm(1000, mean = 0.0, sd = 1), d3 = rnorm(1000, mean = +0.5, sd = 1), d4 = rnorm(1000, mean = 0.0, sd = .5), d5 = rnorm(1000, mean = -0.5, sd = .75)) personHist(multi.proficiency)

personDens(multi.proficiency)

To use these plots in a Wright Map, use the item.side and person.side parameters.

wrightMap(multi.proficiency,thresholds,item.side = itemClassic,person.side = personDens)

Use with CQmodel: The personData and itemData functions

The person side and item side functions are expecting data in the form of matrices. They do not recognize CQmodel objects. When a CQModel object is sent to wrightMap, it first extracts the necessary data, and then sends the data to the plotting functions. In 1.2, the data processing functions have also been made directly accessible to users in the form of the personData and itemData functions. These are fast ways to pull the data out of a CQmodel object in such a way that it is ready to be sent to wrightMap or any of the item and person plotting functions.

The personData function is very simple. It can take either a CQmodel object or a string containing the name of a ConQuest person parameter file. It extracts the person estimates as a matrix.

fpath <- system.file("extdata", package="WrightMap") model1 <- CQmodel(file.path(fpath,"ex7a.eap"), file.path(fpath,"ex7a.shw")) head(model1$p.est) ## casenum est (d1) error (d1) pop (d1) est (d2) error (d2) pop (d2) ## 1 1 1.37364 0.70308 0.60309 1.73654 0.60556 0.52928 ## 2 2 -0.17097 0.64866 0.66216 0.75620 0.54852 0.61379 ## 3 3 0.46677 0.64837 0.66246 0.85146 0.55129 0.60987 ## 4 4 0.67448 0.66017 0.65006 1.16098 0.56368 0.59214 ## 5 5 0.89717 0.67704 0.63195 1.49079 0.58539 0.56012 ## 6 6 1.64704 0.72529 0.57762 2.11784 0.62916 0.49188 m1.person <- personData(model1) head(m1.person) ## d1 d2 ## 1 1.37364 1.73654 ## 2 -0.17097 0.75620 ## 3 0.46677 0.85146 ## 4 0.67448 1.16098 ## 5 0.89717 1.49079 ## 6 1.64704 2.11784 personHist(m1.person,dim.lab.side = 1)

The itemData function uses the GIN table (Thurstonian thresholds) if it is there, and otherwise tries to create delta parameters out of the RMP tables. You can also specify tables to use as items, steps, and interactions, and it will add them together appropriately to create delta parameters.

model2 <- CQmodel(file.path(fpath,"ex4a.mle"), file.path(fpath,"ex4a.shw")) names(model2$RMP) ## [1] "rater" "topic" ## [3] "criteria" "rater*topic" ## [5] "rater*criteria" "topic*criteria" ## [7] "rater*topic*criteria*step" m2.item <- itemData(model2,item.table = "topic", interactions = "rater*topic", step.table = "rater") itemModern(m2.item)

See Tutorial 3 for details on specifying tables from CQmodel objects.

Having these data functions pulled out also makes it easier to combine parameters from different models onto a single plot (when appropriate).

wrightMap(m1.person,m2.item)

Putting it all together with split.screen

By calling these functions directly and using, we can make Wright Maps with other arrangements of persons and items. The item side functions can be combined using any of the base graphics options for combining plots (layout, par(mfrow)), but the person side functions are based on split.screen, which is incompatible with those options. We will be combining item and person maps, so we need to use split.screen.

The first step of combining these functions is to set up the screens. Details for screen functions are in the documentation for split.screen. The function takes as a parameter a 4-column matrix, in which each row is a screen, and the columns represent the left, bottom, right, and top of the screens respectively. Each value is expressed as a number from 0 to 1, where 0 is the left/bottom of the current device and 1 is the right/top.

To make a Wright Map with the items on the left and the persons on the right, we will set up two screens, with 80% of the width on the left and 20% on the right.

split.screen(figs = matrix(c(0,.8,0,1 ,.8,1,0,1),ncol = 4, byrow = TRUE)))

Next, we’ll draw the item side. IMPORTANT NOTE: Make sure to explicitly set the yRange variable when combining plots to ensure they are on the same scale. We can also adjust some of the other parameters to work better with a left-side item plot. We’ll move the logit axis to the left with the show.axis.logit parameter, and set the righthand outer margin to 2 to give us a space between the plots.

itemModern(thresholds, yRange = c(-3,4), show.axis.logits = "L", oma = c(0,0,0,2))

We can also add a title at this time.

mtext("Wright Map", side = 3, font = 2, line = 1)

Finally, we will move to screen 2 and draw the person side. This plot will be adjusted to move the persons label and remove the axis.

screen(2) personHist(multi.proficiency, axis.persons = "",yRange = c(-3,4) , axis.logits = "Persons", show.axis.logits = FALSE)

The last thing to do is to close all the screens to prevent them from getting in the way of any future plotting.

close.screen(all.screens = TRUE)

Here is the complete plot:

split.screen(figs = matrix(c(0,.8,0,1,.8,1,0,1),ncol = 4, byrow = TRUE)) itemModern(thresholds, yRange = c(-3,4), show.axis.logits = "L", oma = c(0,0,0,2)) mtext("Wright Map", side = 3, font = 2, line = 1) screen(2) personHist(multi.proficiency, axis.persons = "",yRange = c(-3,4) , axis.logits = "Persons", show.axis.logits = FALSE) close.screen(all.screens = TRUE)

Countless arrangements are possible. As one last example, here are two ways to put two dimensions put side by side in separate Wright Maps.

Explicitly splitting the device into four screens:

d1 = rnorm(1000, mean = -0.5, sd = 1) d2 = rnorm(1000, mean = 0.0, sd = 1) dim1.diff <- rnorm(5) dim2.diff <- rnorm(5) split.screen(figs = matrix(c(0,.09,0,1, .11,.58,0,1, .5,.59,0,1, .51,1,0,1),ncol = 4,byrow = TRUE)) personDens(d1,yRange = c(-3,3),show.axis.logits = FALSE , axis.logits = "") screen(2) itemModern(dim1.diff,yRange = c(-3,3),show.axis.logits = FALSE) mtext("Wright Map", side = 3, font = 2, line = 1) screen(3) personDens(d2,yRange = c(-3,3),show.axis.logits = FALSE , axis.logits = "" , axis.persons = "",dim.names = "Dim2") screen(4) itemModern(dim2.diff,yRange = c(-3,3),show.axis.logits = FALSE , label.items = paste("Item",6:10))

close.screen(all.screens = TRUE)

Splitting the device into two screens with a Wright Map on each:

split.screen(figs = matrix(c(0,.5,0,1, .5,1,0,1),ncol = 4,byrow = TRUE)) wrightMap(d1,dim1.diff,person.side = personDens,show.axis.logits = FALSE) screen(2) wrightMap(d2,dim2.diff,person.side = personDens,show.axis.logits = FALSE) close.screen(all.screens = TRUE)

To leave a comment for the author, please follow the link and comment on their blog: R Snippets for IRT. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Introducing fidlr: FInancial Data LoadeR

Thu, 2016-04-21 16:08

(This article was first published on R – The R Trader, and kindly contributed to R-bloggers)

fidlr is an RSutio addin designed to simplify the financial data downloading process from various providers. This initial version is a wrapper around the getSymbols function in the quantmod package and only Yahoo, Google, FRED and Oanda are supported. I will probably add functionalities over time. As usual with those things just a kind reminder: “THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND…”

How to install and use fidlr?

  1. You can get the addin/package from its Github repository here (I will register it on CRAN later on)
  2. Install the addin. There is an excellent tutorial to install RStudio Addins here.
  3. Once the addin is installed it should appear in the Addin menu. Just chose fidlr in the menu and a window as pictured below should appear.
  4. Choose a data provider from the  the Source dropdown menu.
  5. Select a date range from the Date menu
  6. Enter the symbol you wish to download in the instrument text box. To download several symbols just enter the symbols separated by commas.
  7. Use the Radio buttons to choose whether you want to download the instrument in a csv file or in the global environment. The csv file will be saved in the working directory and there will be one csv file per instrument.
  8. Press Run to get the data or Close to close down the addin

Error messages and warnings are handled by the underlying packages (quantmod and  Shiny) and can be read from the console

This is a very first version of the project so do not expect perfection but hopefully it will get better over time. Please report any comment, suggestion, bug etc… to: thertrader@gmail.com

Enjoy!

To leave a comment for the author, please follow the link and comment on their blog: R – The R Trader. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Principal curves example (Elements of Statistical Learning)

Thu, 2016-04-21 14:01

(This article was first published on R – BioStatMatt, and kindly contributed to R-bloggers)

The bit of R code below illustrates the principal curves methods as described in The Elements of Statistical Learning, by Hastie, Tibshirani, and Friedman (Ch. 14; the book is freely available from the authors’ website). Specifically, the code generates some bivariate data that have a nonlinear association, initializes the principal curve using the first (linear) principal component, and then computes three iterations of the algorithm described in section 14.5.2. I used the ‘animation’ package to generate the following animated GIF, which illustrates these steps.

## generate some bivariate data set.seed(42) x1 <- seq(1,10,0.3) w = .6067; a0 = 1.6345; a1 = -.6235; b1 = -1.3501; a2 = -1.1622; b2 = -.9443; x2 = a0 + a1*cos(x1*w) + b1*sin(x1*w) + a2*cos(2*x1*w) + b2*sin(2*x1*w) + rnorm(length(x1),0,3/4) x <- scale(cbind(x1,x2)) alim <- extendrange(x, f=0.1) alim_ <- range(x) ## plot centered data plot(x[,1], x[,2], bty='n', xlab=expression(x[1]), ylab=expression(x[2]), xlim=alim, ylim=alim) legend("topleft", legend=c("Initialize"), bty="n") ## plot first principal component line svdx <- svd(x) clip(alim_[1],alim_[2],alim_[1],alim_[2]) with(svdx, abline(a=0, b=v[2,1]/v[1,1])) ## plot projections of each point onto line z1 <- with(svdx, x%*%v[,1]%*%t(v[,1])) segments(x0=x[,1],y0=x[,2], x1=z1[,1],y1=z1[,2]) ## compute initial lambda (arc-lengths associated with ## orthogonal projections of data onto curve) lam <- with(svdx, as.numeric(u[,1]*d[1])) for(itr in 1:3) { #### step (a) of iterative algorithm #### ## compute scatterplot smoother in either dimension ## increase 'df' to make the curve more flexible fit1 <- smooth.spline(x=lam, y=x[,1], df=4) fit2 <- smooth.spline(x=lam, y=x[,2], df=4) ## plot data and the principal curve for a sequence of lambdas plot(x[,1], x[,2], bty='n', xlab=expression(x[1]), ylab=expression(x[2]), xlim=alim, ylim=alim) legend("topleft", legend=c("Step (a)"), bty="n") seq_lam <- seq(min(lam),max(lam),length.out=100) lines(predict(fit1, seq_lam)$y, predict(fit2, seq_lam)$y) ## show points along curve corresponding ## to original lambdas z1 <- cbind(predict(fit1, lam)$y, predict(fit2, lam)$y) segments(x0=x[,1],y0=x[,2], x1=z1[,1],y1=z1[,2]) #### step (b) of iterative algorithm #### ## recompute lambdas euc_dist <- function(l, x, f1, f2) sum((c(predict(f1, l)$y, predict(f2, l)$y) - x)^2) lam <- apply(x,1,function(x0) optimize(euc_dist, interval=extendrange(lam, f=0.50), x=x0, f1=fit1, f2=fit2)$minimum) ## show projections associated with recomputed lambdas plot(x[,1], x[,2], bty='n', xlab=expression(x[1]), ylab=expression(x[2]), xlim=alim, ylim=alim) legend("topleft", legend=c("Step (b)"), bty="n") seq_lam <- seq(min(lam),max(lam),length.out=100) lines(predict(fit1, seq_lam)$y, predict(fit2, seq_lam)$y) z1 <- cbind(predict(fit1, lam)$y, predict(fit2, lam)$y) segments(x0=x[,1],y0=x[,2], x1=z1[,1],y1=z1[,2]) }

To leave a comment for the author, please follow the link and comment on their blog: R – BioStatMatt. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Get ready for R/Finance 2016

Thu, 2016-04-21 11:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

R/Finance 2016 is less than a month away and, as always, I am very much looking forward to it. In past years, I have elaborated on what puts it among my favorite conferences even though I am not a finance guy. R/Finance is small, single track and intense with almost no fluff. And scattered among the esoterica of finance and trading there has, so far, always been a rich mix of mathematics, time series applications, R programming, stimulating conversation and attitude. When it comes down to it, it’s the people, the organizers and participants who make a conference. Looking over the agenda for this year, I am sure that once again, for two days at least, Chicago will be the center of the R world.

This year, however, I am going to be ready for R/Finance. I am going to do my homework. If I had done a little prep last year I would have had a copy of Arthur Koestler’s The Sleepwalkers in my bag. So when Emanuel Derman went deep philosophy I could have gone through that looking-glass with him.

So what’s on the line up this year? Rishi Narang will lead off for the keynote speakers with a talk provocatively entitled “Rage Against the Machine Learning”. There is not much online for and industry outsider to latch onto, but it probably wouldn’t hurt to have a look at one of his three books on quantitative trading.

Tarek Eldin will deliver the second keynote entitled ‘Random Pricing Errors and Systematic Returns: The Flaw in Fundamental Prices” My guess is that this online paper might provide some relevant preparatory reading.

Frank Diebold’s keynote is entitled “Estimating Global Bank Network Connectedness”. I think it’s a safe bet that his recent paper with Mert Demirer, Laura Liu and Kami Yilmaz will indeed be relevant.

Batting cleanup for the keynote speakers will be none other than the R Inferno himself, who vaguely and possibly misleadingly suggests that preparation for his talk, “Some Linguistics of Quantitative Finance” might begin with Yucatan.

Satellite image of Península de Yucatán, México. Credit: Nasa

For preparation on more solid ground, I am going to look into the R packages explicitly called out in the agenda. Of course, there will be Rcpp. Chicago is Eddelbuettel country and no doubt much of the conversation over coffee will revolve around high performance computing. But, even R users who are not particularly interested in writing high performance code themselves ought to know something about this package. With a reverse table listing hundreds of packages it is becoming the foundation for much of R.

In addition to Dirk’s tutorial on Rcpp and RcppArmadillo, Matt Dziubinski will talk about getting the most out of Rcpp in practice and Jason foster will talk about using RcppParallel for multi-asset principal component regression. Look here for some older talks by Matt.

Robert McDonald will describe the derivmkts package which contains functions that support his book Derivatives Markets.

Eran Raviv will talk about combining multiple forecasts using R’s  ForecastCombinations package.

Kjell Konis will describe how to compare Fitted Factor Models with his fit.models package.

Steven Pav will speak of madness, package for multivariate automatic differentiation. There is a very nice vignette that describes the mathematics of madness.  

Qiang Kou will talk about deep learning in R using the MxNet package which makes use of GPUs.

Mario Annau will talk about the h5 package, an S4 interface to the HDF5 storage format.

Robert Krzyzanowski will describe the Syberia development framework for R.

Dirk Eddelbuettel will revisit the Rblapi package for connecting R to Bloomberg.

Michael Kane will talk about a new package he is writing glmnetlib which is intended to be a low-level library for Regularized Regression.

Matt Brigida use a Shiny implementation to talk about Community Finance Teaching Resources.

When I registered for the conference I saw that the preconference tutorial by Harte and Weylandt on modern Bayesian tools for time series analysis is going to use STAN. So, I need to add rstan to the list.

In his tutorial on leveraging the Azure cloud from R, Doug Service will show how to use the foreach package in the Azure environment.

And then, for some serious preparation it might be helpful to take a look at the math underlying some of the presentations. For example, Klaus Spanderen will talk about calibrating Heston Local Stochastic Volatility Models. Sida Yang will discuss using Latent Dirichlet Allocation to discover distributions underlying financial news topics and Pedro Alexander will discuss portfolio selection with support vector regression.

All that I have listed won't even cover half of what will be presented at the conference, however, I hope some of it will be helpful in preparing for R/Finance. But, most importantly, don’t forget to register! Unfortunately, this year many, if not most, of the people who would like to go to the useR! conference will not be able to attend. Don’t get locked out of R/Finance too!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

an integer programming riddle

Wed, 2016-04-20 18:16

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A puzzle on The Riddler this week that ends up as a standard integer programming problem. Removing the little story around the question, it boils down to optimise

200a+100b+50c+25d

under the constraints

400a+400b+150c+50d≤1000, b≤a, a≤1, c≤8, d≤4,

and (a,b,c,d) all non-negative integers. My first attempt was a brute force R code since there are only 3 x 9 x 5 = 135 cases:

f.obj<-c(200,100,50,25) f.con<-matrix(c(40,40,15,5, -1,1,0,0, 1,0,0,0, 0,0,1,0, 0,0,0,1),ncol=4,byrow=TRUE) f.dir<-c("<=","<=","<=","<=","<=") f.rhs<-c(100,0,2,8,4) sol=0 for (a in 0:1) for (b in 0:a) for (k in 0:8) for (d in 0:4){ cost=f.con%*%c(a,b,k,d)-f.rhs if (max(cost)<=0){ gain=f.obj%*%c(a,b,k,d) if (gain>sol){ sol=gain argu=c(a,b,k,d)}}

which returns the value:

> sol [,1] [1,] 425 > argu [1] 1 0 3 3

This is confirmed by a call to an integer programming code like lpSolve:

> lp("max",f.obj,f.con,f.dir,f.rhs,all.int=TRUE) Success: the objective function is 425 > lp("max",f.obj,f.con,f.dir,f.rhs,all.int=TRUE)$sol [1] 1 0 3 3

which provides the same solution.

Filed under: Books, Kids, R Tagged: 538, cross validated, FiveThirtyEight, integer programming, The Riddler

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Pride and Prejudice and Z-scores

Wed, 2016-04-20 14:38

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

You might think literary criticism is no place for statistical analysis, but given digital versions of the text you can, for example, use sentiment analysis to infer the dramatic arc of an Oscar Wilde novel. Now you can apply similar techniques to the works of Jane Austen thanks to Julia Silge's R package janeaustenr (available on CRAN). The package includes the full text the 6 Austen novels, including Pride and Prejudice and Sense and Sensibility.

With the novels' text in hand, Julia then applied Bing sentiment analysis (as implemented in R's syuzhet package), shown here with annotations marking the major dramatic turns in the book:

There's quite a lot of noise in that chart, so Julia took the elegant step of using a low-pass fourier transform to smooth the sentiment for all six novels, which allows for a comparison of the dramatic arcs:

An apparent Austen afficionada, Julia interprets the analysis:

This is super interesting to me. Emma and Northanger Abbey have the most similar plot trajectories, with their tales of immature women who come to understand their own folly and grow up a bit. Mansfield Park and Persuasion also have quite similar shapes, which also is absolutely reasonable; both of these are more serious, darker stories with main characters who are a little melancholic. Persuasion also appears unique in starting out with near-zero sentiment and then moving to more dramatic shifts in plot trajectory; it is a markedly different story from Austen’s other works.

For more on the techniques of the analysis, including all the R code (plus some clever Austen-based puns), check out Julia's complete post linked below.

data science ish: If I Loved Natural Language Processing Less, I Might Be Able to Talk About It More

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Installing SQL Server ODBC drivers on Ubuntu (in Travis-CI)

Wed, 2016-04-20 10:13

(This article was first published on R – It's a Locke, and kindly contributed to R-bloggers)

Did you know you can now get SQL Server ODBC drivers for Ubuntu? Yes, no, maybe? It’s ok even if you haven’t since it’s pretty new! Anyway, this presents me with an ideal opportunity to standardise my SQL Server ODBC connections across the operating systems I use R on i.e. Windows and Ubuntu. My first trial was to get it working on Travis-CI since that’s where all my training magic happens and if it can’t work on a clean build like Travis, then where can it work?

Now I can create R functionality that can reliably depend on SQL Server without having to fallback to JDBC. A definite woohoo moment!

TL;DR

It works, but it’s really hacky right now. Definitely looking forward to the next iterations of this driver. I’m also really glad I could squelch all my commits when I merged the dev branch to master for this exercise – it took a while to remember I could test my commands on an ubuntu docker container first – and even when I tested on docker I still had to test on travis line by line. The final .travis.yml file is available for folks to copy & paste from.

Disclaimer
  • Each line in the travis file could be put into a generic script and used on any ubuntu system but there may be some steps missing like installing gcc that are present on the Travis infrastructure. You probably can’t take the script and expect it to work elsewhere first time though.
  • This is currently hacky, and Microsoft are on the case for improving it so this post could quickly become out of date.
  • Be very careful installing the driver on an existing machine. Due to the overwriting of unixODBC if already installed and potential compatibility issues with other driver managers you may have installed.

Line by line - wget https://download.microsoft.com/download/2/E/5/2E58F097-805C-4AB8-9FC6-71288AB4409D/msodbcsql-13.0.0.0.tar.gz -P ..

Download the compressed file containing all the relevant stuff. This URL is important – the website does not provide a URL like this and this one is likely to be unstable. Microsoft are aware of this as a problem for users who like to script everything and will hopefully be addressing it in the short to medium term.

The -P .. tells the wget command to dump the file in the parent directory so that it won’t set off warnings when I build my R package.

- tar xvzf ../msodbcsql-13.0.0.0.tar.gz -C ..

This little line unzips the file we just downloaded to the parent directory.

- sed -i '14d' ../msodbcsql-13.0.0.0/build_dm.sh - sed -i '/tmp=/ctmp=/tmp/odbcbuilds' ../msodbcsql-13.0.0.0/build_dm.sh

Unfortunately the default script that should be executed next generates a random directory for the unixODBC driver manager. The random directory is present in the output text and not easy to pipe into the next command. Consequently, with much help from Vin from MSFT we have this current hack to change the directory to a fixed directory.

- ../msodbcsql-13.0.0.0/build_dm.sh --accept-warning

This line runs a shell script that builds the unixODBC driver manager. Note – you can’t rely on the unixODBC driver available via apt-get at this time due to the SQL Server ODBC driver not being compatible (currently) with the latest versions. Also, it wasn’t noted in the manual but I had to add the --accept-warning to suppress some sort of notification that wanted to be triggered. I suspect I just sold my soul and that I’m encouraging you to do the same.

- cd /tmp/odbcbuilds/unixODBC-2.3.1 - sudo make install

These lines shunts us over to the directory for the unixODBC build and installs it. The sudo is necessary for the installation to the usr/ directory.

- cd $TRAVIS_BUILD_DIR

This gets you back to the your starting package directory for continuing on to package install.

- sudo apt-get install libgss3 -y

This dependency was needed by the ODBC driver

- ../msodbcsql-13.0.0.0/install.sh verify

Verify the driver can be installed. This line wasn’t so great since it doesn’t check for a bug/feature – that you’re in the right directory – otherwise, a series of file copies in the install process won’t work.

- cd ../msodbcsql-13.0.0.0/ - sudo ./install.sh install --accept-license

Proceed to install the driver in the right directory

- odbcinst -q -d -n "ODBC Driver 13 for SQL Server"

Test the driver is usable

The final file language: r sudo: true warnings_are_errors: true cache: packages r_github_packages: - rich-iannone/DiagrammeR before_install: - chmod 755 ./.push_gh_pages.sh - wget https://download.microsoft.com/download/2/E/5/2E58F097-805C-4AB8-9FC6-71288AB4409D/msodbcsql-13.0.0.0.tar.gz -P .. - tar xvzf ../msodbcsql-13.0.0.0.tar.gz -C .. - sed -i '14d' ../msodbcsql-13.0.0.0/build_dm.sh - sed -i '/tmp=/ctmp=/tmp/odbcbuilds' ../msodbcsql-13.0.0.0/build_dm.sh - ../msodbcsql-13.0.0.0/build_dm.sh --accept-warning - cd /tmp/odbcbuilds/unixODBC-2.3.1 - sudo make install - cd $TRAVIS_BUILD_DIR - sudo apt-get install libgss3 -y - ../msodbcsql-13.0.0.0/install.sh verify - cd ../msodbcsql-13.0.0.0/ - sudo ./install.sh install --accept-license - cd $TRAVIS_BUILD_DIR - odbcinst -q -d -n "ODBC Driver 13 for SQL Server" after_success: - ./.push_gh_pages.sh The manuals (for reading)

The post Installing SQL Server ODBC drivers on Ubuntu (in Travis-CI) appeared first on It's a Locke.

To leave a comment for the author, please follow the link and comment on their blog: R – It's a Locke. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

yorkr crashes the IPL party! – Part 2

Wed, 2016-04-20 10:11

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Most people say that it is the intellect which makes a great scientist. They are wrong: it is character.

Albert Einstein

*Science is organized knowledge. Wisdom is organized life.“*

Immanuel Kant

If I have seen further, it is by standing on the shoulders of giants

Isaac Newton

Valid criticism does you a favor.

Carl Sagan Introduction

In this post, my R package ‘yorkr’, continues to bat in the IPL Twenty20s. This post is a continuation of my earlier post – yorkr crashes the IPL party ! – Part 1. This post deals with Class 2 functions namely the performances of an IPL team in all T20 matches against another IPL team for e.g all T20 matches of Chennai Super Kings vs Royal Challengers Bangalore or Kochi Tuskers Kerala vs Mumbai Indians etc.

You can clone/fork the code for my package yorkr from Github at yorkr

This post has also been published at RPubs IPLT20-Part2 and can also be downloaded as a PDF document from IPLT20-Part2.pdf

The list of function in Class 2 are

  1. teamBatsmenPartnershiOppnAllMatches()
  2. teamBatsmenPartnershipOppnAllMatchesChart()
  3. teamBatsmenVsBowlersOppnAllMatches()
  4. teamBattingScorecardOppnAllMatches()
  5. teamBowlingPerfOppnAllMatches()
  6. teamBowlersWicketsOppnAllMatches()
  7. teamBowlersVsBatsmenOppnAllMatches()
  8. teamBowlersWicketKindOppnAllMatches()
  9. teamBowlersWicketRunsOppnAllMatches()
  10. plotWinLossBetweenTeams()
1. Install the package from CRAN library(yorkr) rm(list=ls()) 2. Get data for all T20 matches between 2 teams

We can get all IPL T20 matches between any 2 teams using the function below. The dir parameter should point to the folder which has the IPL T20 RData files of the individual matches. This function creates a data frame of all the IPL T20 matches and also saves the dataframe as RData. The function below gets all matches between India and Australia

setwd("C:/software/cricket-package/york-test/yorkrData/IPL/IPL-T20-matches") matches <- getAllMatchesBetweenTeams("Sunrisers Hyderabad","Royal Challengers Bangalore",dir=".") dim(matches) ## [1] 1320 25

I have however already saved the IPL Twenty20 matches for all possible combinations of opposing IPL Teams. The data for these matches for the individual teams/countries can be obtained from Github at in the folder IPL-T20-allmatches-between-two-teams

Note: You will need to use the function below for future matches! The data in Cricsheet are from 2008 -2015

3. Save data for all matches between all combination of 2 teams

This can be done locally using the function below. You could use this function to combine all IPL Twenty20 matches between any 2 IPL teams into a single dataframe and save it in the current folder. The current implementation expects that the the RData files of individual matches are in ../data folder. Since I already have converted this I will not be running this again

# Available in yorkr_0.0.5. Can be installed from Github though! #saveAllMatchesBetween2IPLTeams() 4. Load data directly for all matches between 2 IPL teams

As in my earlier post I pick all IPL Twenty20 matches between 2 random IPL teams. I load the data directly from the stored RData files. When we load the Rdata file a “matches” object will be created. This object can be stored for the apporpriate teams as below

# Load T20 matches between 2 IPL teams setwd("C:/software/cricket-package/york-test/yorkrData/IPL/IPL-T20-allmatches-between-two-teams") load("Chennai Super Kings-Delhi Daredevils-allMatches.RData") csk_dd_matches <- matches load("Deccan Chargers-Kolkata Knight Riders-allMatches.RData") dc_kkr_matches <- matches load("Mumbai Indians-Pune Warriors-allMatches.RData") mi_pw_matches <- matches load("Rajasthan Royals-Sunrisers Hyderabad-allMatches.RData") rr_sh_matches <- matches load("Kings XI Punjab-Royal Challengers Bangalore-allMatches.RData") kxip_rcb_matches <-matches load("Chennai Super Kings-Kochi Tuskers Kerala-allMatches.RData") csk_ktk_matches <-matches 5. Team Batsmen partnership in Twenty20 (all matches with opposing IPL team)

This function will create a report of the batting partnerships in the IPL teams for the matches between the teams. The report can be brief or detailed depending on the parameter ‘report’. As can be seen M S Dhoni tops the list for CSK, followed by Raina and then Murali Vijay for matches against Delhi Daredevils. For the Delhi Daredevils it is V Sehawag followed by Gambhir.

m<- teamBatsmenPartnershiOppnAllMatches(csk_dd_matches,'Chennai Super Kings',report="summary") m ## Source: local data frame [29 x 2] ## ## batsman totalRuns ## (fctr) (dbl) ## 1 MS Dhoni 364 ## 2 SK Raina 335 ## 3 M Vijay 290 ## 4 S Badrinath 185 ## 5 ML Hayden 181 ## 6 MEK Hussey 169 ## 7 F du Plessis 100 ## 8 S Vidyut 94 ## 9 DR Smith 81 ## 10 JA Morkel 80 ## .. ... ... m<- teamBatsmenPartnershiOppnAllMatches(csk_dd_matches,'Delhi Daredevils',report="summary") m ## Source: local data frame [53 x 2] ## ## batsman totalRuns ## (fctr) (dbl) ## 1 V Sehwag 233 ## 2 G Gambhir 200 ## 3 DA Warner 134 ## 4 AB de Villiers 133 ## 5 KD Karthik 129 ## 6 DPMD Jayawardene 89 ## 7 JA Morkel 81 ## 8 TM Dilshan 79 ## 9 S Dhawan 78 ## 10 SS Iyer 77 ## .. ... ... m <-teamBatsmenPartnershiOppnAllMatches(dc_kkr_matches,'Deccan Chargers',report="summary") m ## Source: local data frame [29 x 2] ## ## batsman totalRuns ## (fctr) (dbl) ## 1 AC Gilchrist 166 ## 2 HH Gibbs 145 ## 3 RG Sharma 116 ## 4 S Dhawan 111 ## 5 A Symonds 100 ## 6 Y Venugopal Rao 92 ## 7 B Chipli 60 ## 8 DB Ravi Teja 54 ## 9 TL Suman 53 ## 10 VVS Laxman 32 ## .. ... ... m <-teamBatsmenPartnershiOppnAllMatches(mi_pw_matches,'Mumbai Indians',report="detailed") m[1:30,] ## batsman nonStriker partnershipRuns totalRuns ## 1 SR Tendulkar JEC Franklin 24 152 ## 2 SR Tendulkar AT Rayudu 46 152 ## 3 SR Tendulkar RG Sharma 2 152 ## 4 SR Tendulkar KD Karthik 20 152 ## 5 SR Tendulkar RT Ponting 39 152 ## 6 SR Tendulkar AC Blizzard 12 152 ## 7 SR Tendulkar RJ Peterson 9 152 ## 8 RG Sharma SR Tendulkar 3 135 ## 9 RG Sharma JEC Franklin 0 135 ## 10 RG Sharma AT Rayudu 34 135 ## 11 RG Sharma A Symonds 19 135 ## 12 RG Sharma KD Karthik 19 135 ## 13 RG Sharma KA Pollard 47 135 ## 14 RG Sharma TL Suman 7 135 ## 15 RG Sharma GJ Maxwell 6 135 ## 16 KD Karthik SR Tendulkar 8 108 ## 17 KD Karthik JEC Franklin 32 108 ## 18 KD Karthik AT Rayudu 3 108 ## 19 KD Karthik RG Sharma 50 108 ## 20 KD Karthik SL Malinga 10 108 ## 21 KD Karthik PP Ojha 0 108 ## 22 KD Karthik RJ Peterson 4 108 ## 23 KD Karthik NLTC Perera 1 108 ## 24 AT Rayudu SR Tendulkar 54 92 ## 25 AT Rayudu RG Sharma 37 92 ## 26 AT Rayudu KD Karthik 1 92 ## 27 JEC Franklin SR Tendulkar 31 63 ## 28 JEC Franklin RG Sharma 1 63 ## 29 JEC Franklin KD Karthik 15 63 ## 30 JEC Franklin SA Yadav 10 63 m <-teamBatsmenPartnershiOppnAllMatches(rr_sh_matches,'Sunrisers Hyderabad',report="summary") m ## Source: local data frame [23 x 2] ## ## batsman totalRuns ## (fctr) (dbl) ## 1 S Dhawan 168 ## 2 DJG Sammy 95 ## 3 EJG Morgan 90 ## 4 DA Warner 83 ## 5 NV Ojha 50 ## 6 KL Rahul 40 ## 7 RS Bopara 40 ## 8 DW Steyn 31 ## 9 CL White 31 ## 10 MC Henriques 29 ## .. ... ... m <-teamBatsmenPartnershiOppnAllMatches(kxip_rcb_matches,'Kings XI Punjab',report="summary") m ## Source: local data frame [47 x 2] ## ## batsman totalRuns ## (fctr) (dbl) ## 1 SE Marsh 246 ## 2 DA Miller 224 ## 3 RS Bopara 203 ## 4 AC Gilchrist 191 ## 5 Yuvraj Singh 126 ## 6 MS Bisla 103 ## 7 Mandeep Singh 100 ## 8 DJ Hussey 99 ## 9 Azhar Mahmood 96 ## 10 KC Sangakkara 88 ## .. ... ... m <-teamBatsmenPartnershiOppnAllMatches(csk_ktk_matches,'Kochi Tuskers Kerala',report="summary") m ## Source: local data frame [8 x 2] ## ## batsman totalRuns ## (fctr) (dbl) ## 1 BB McCullum 80 ## 2 BJ Hodge 70 ## 3 PA Patel 40 ## 4 RA Jadeja 35 ## 5 Y Gnaneswara Rao 19 ## 6 DPMD Jayawardene 16 ## 7 OA Shah 3 ## 8 KM Jadhav 1 6. Team batsmen partnership in Twenty20 (all matches with opposing IPL team)

This is plotted graphically in the charts below. The partnerships are shown. Note: All functions which create a plot also include a parameter plot=TRUE/FALSE. If you set this as FALSE then a data frame is returned. You can use the dataframe to create an interactive plot for the partnerships (mouse over) using packages like plotly,rcharts, googleVis or ggvis.

teamBatsmenPartnershipOppnAllMatchesChart(csk_dd_matches,'Chennai Super Kings',"Delhi Daredevils")

teamBatsmenPartnershipOppnAllMatchesChart(dc_kkr_matches,main="Kolkata Knight Riders",opposition="Deccan Chargers")

teamBatsmenPartnershipOppnAllMatchesChart(kxip_rcb_matches,"Royal Challengers Bangalore",opposition="Kings XI Punjab")

teamBatsmenPartnershipOppnAllMatchesChart(mi_pw_matches,"Mumbai Indians","Pune Warriors")

m <- teamBatsmenPartnershipOppnAllMatchesChart(rr_sh_matches,"Rajasthan Royals","Sunrisers Hyderabad",plot=FALSE) m[1:30,] ## batsman nonStriker runs ## 1 SR Watson STR Binny 60 ## 2 AM Rahane STR Binny 59 ## 3 STR Binny AM Rahane 45 ## 4 SR Watson R Dravid 42 ## 5 AM Rahane SV Samson 41 ## 6 BJ Hodge SV Samson 36 ## 7 CH Morris STR Binny 34 ## 8 AM Rahane SR Watson 31 ## 9 R Dravid SR Watson 30 ## 10 SV Samson AM Rahane 29 ## 11 SR Watson AM Rahane 27 ## 12 SPD Smith DJ Hooda 25 ## 13 SPD Smith JP Faulkner 24 ## 14 SPD Smith STR Binny 20 ## 15 R Dravid AM Rahane 18 ## 16 BJ Hodge JP Faulkner 18 ## 17 JP Faulkner SPD Smith 18 ## 18 SV Samson KK Nair 14 ## 19 JP Faulkner STR Binny 14 ## 20 SV Samson STR Binny 13 ## 21 SPD Smith AM Rahane 13 ## 22 SR Watson SPD Smith 12 ## 23 STR Binny JP Faulkner 12 ## 24 STR Binny SPD Smith 12 ## 25 JP Faulkner SV Samson 12 ## 26 KK Nair SV Samson 12 ## 27 JP Faulkner BJ Hodge 11 ## 28 SPD Smith SR Watson 10 ## 29 STR Binny SR Watson 9 ## 30 SV Samson BJ Hodge 9 7. Team batsmen versus bowler in Twenty20 (all matches with opposing IPL team)

The plots below provide information on how each of the top batsmen of the IPL teams fared against the opposition bowlers

# Adam Gilchrist was the top performer for Deccan Chargers teamBatsmenVsBowlersOppnAllMatches(dc_kkr_matches,"Deccan Chargers","Kolkata Knight Riders")

teamBatsmenVsBowlersOppnAllMatches(csk_dd_matches,"Delhi Daredevils","Chennai Super Kings",top=3)

m <- teamBatsmenVsBowlersOppnAllMatches(csk_ktk_matches,"Chennai Super Kings","Kochi Tuskers Kerala",top=10,plot=FALSE) m ## Source: local data frame [37 x 3] ## Groups: batsman [1] ## ## batsman bowler runs ## (fctr) (fctr) (dbl) ## 1 SK Raina RP Singh 6 ## 2 SK Raina S Sreesanth 18 ## 3 SK Raina M Muralitharan 1 ## 4 SK Raina R Vinay Kumar 4 ## 5 SK Raina NLTC Perera 11 ## 6 SK Raina RR Powar 13 ## 7 SK Raina RV Gomez 16 ## 8 WP Saha RP Singh 15 ## 9 WP Saha M Muralitharan 11 ## 10 WP Saha BJ Hodge 1 ## .. ... ... ... teamBatsmenVsBowlersOppnAllMatches(rr_sh_matches,"Sunrisers Hyderabad","Rajasthan Royals")

8. Team batsmen versus bowler in Twenty20(all matches with opposing IPL team)

The following tables gives the overall performances of the IPL team’s batsmen against the opposition.

#Chris Gayle followed by Virat Kohli tops for RCB a <-teamBattingScorecardOppnAllMatches(kxip_rcb_matches,main="Royal Challengers Bangalore",opposition="Kings XI Punjab") ## Total= 2444 a ## Source: local data frame [55 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (int) (int) (dbl) ## 1 CH Gayle 313 45 41 561 ## 2 V Kohli 296 39 8 344 ## 3 AB de Villiers 183 23 16 301 ## 4 JH Kallis 133 18 7 187 ## 5 R Dravid 90 11 1 105 ## 6 RV Uthappa 47 7 6 92 ## 7 CA Pujara 66 11 NA 70 ## 8 MK Pandey 50 5 3 67 ## 9 KP Pietersen 43 7 1 66 ## 10 MV Boucher 36 4 1 41 ## .. ... ... ... ... ... #Tendulkar & Rohit Sharma lead for Mumbai Indians teamBattingScorecardOppnAllMatches(mi_pw_matches,"Mumbai Indians","Pune Warriors") ## Total= 756 ## Source: local data frame [20 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (int) (int) (dbl) ## 1 SR Tendulkar 134 21 1 152 ## 2 RG Sharma 121 7 6 135 ## 3 KD Karthik 107 10 3 108 ## 4 AT Rayudu 93 8 1 92 ## 5 JEC Franklin 70 5 2 63 ## 6 KA Pollard 43 3 3 55 ## 7 TL Suman 16 3 3 36 ## 8 Harbhajan Singh 22 3 1 29 ## 9 SL Malinga 16 2 1 19 ## 10 A Symonds 18 2 NA 19 ## 11 RT Ponting 17 2 NA 14 ## 12 GJ Maxwell 7 1 1 13 ## 13 RJ Peterson 13 1 NA 13 ## 14 AC Blizzard 6 1 NA 6 ## 15 PP Ojha 2 NA NA 1 ## 16 MM Patel 2 NA NA 1 ## 17 RE Levi 2 NA NA 0 ## 18 SA Yadav 4 NA NA 0 ## 19 NLTC Perera 4 NA NA 0 ## 20 DR Smith 1 NA NA 0 teamBattingScorecardOppnAllMatches(mi_pw_matches,"Pune Warriors","Mumbai Indians") ## Total= 714 ## Source: local data frame [28 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (int) (int) (dbl) ## 1 RV Uthappa 131 13 4 151 ## 2 MK Pandey 80 5 4 88 ## 3 Yuvraj Singh 62 3 6 77 ## 4 M Manhas 36 5 NA 42 ## 5 SPD Smith 38 4 NA 41 ## 6 MR Marsh 26 2 2 38 ## 7 M Kartik 21 2 1 25 ## 8 R Sharma 22 2 1 23 ## 9 TL Suman 15 5 NA 23 ## 10 WD Parnell 24 3 NA 22 ## .. ... ... ... ... ... teamBattingScorecardOppnAllMatches(csk_dd_matches,"Delhi Daredevils","Chennai Super Kings") ## Total= 1983 ## Source: local data frame [53 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (int) (int) (dbl) ## 1 V Sehwag 147 27 9 233 ## 2 G Gambhir 155 23 2 200 ## 3 DA Warner 130 11 2 134 ## 4 AB de Villiers 80 7 6 133 ## 5 KD Karthik 99 15 1 129 ## 6 DPMD Jayawardene 77 7 2 89 ## 7 JA Morkel 63 8 2 81 ## 8 TM Dilshan 65 8 3 79 ## 9 S Dhawan 58 8 2 78 ## 10 SS Iyer 56 11 1 77 ## .. ... ... ... ... ... teamBattingScorecardOppnAllMatches(rr_sh_matches,"Rajasthan Royals","Sunrisers Hyderabad") ## Total= 808 ## Source: local data frame [17 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (int) (int) (dbl) ## 1 SR Watson 97 22 4 148 ## 2 AM Rahane 145 17 1 148 ## 3 SPD Smith 81 11 2 103 ## 4 STR Binny 83 6 1 90 ## 5 SV Samson 83 3 4 76 ## 6 JP Faulkner 41 7 2 59 ## 7 BJ Hodge 37 2 5 55 ## 8 R Dravid 44 7 1 48 ## 9 CH Morris 11 2 3 34 ## 10 KK Nair 23 3 NA 17 ## 11 R Bhatia 10 1 NA 8 ## 12 DS Kulkarni 6 1 NA 7 ## 13 DJ Hooda 9 NA NA 7 ## 14 AM Nayar 3 1 NA 4 ## 15 PV Tambe 7 NA NA 3 ## 16 KW Richardson 2 NA NA 1 ## 17 DH Yagnik 4 NA NA 0 9. Team performances of IPL bowlers (all matches with opposing IPL team)

Like the function above the following tables provide the top IPL bowlers of the respective teams in the matches against the opposition.

#Piyush Chawla has the most wickets for KXIP against RCB teamBowlingPerfOppnAllMatches(kxip_rcb_matches,"Kings XI Punjab","Royal Challengers Bangalore") ## Source: local data frame [38 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (dbl) ## 1 PP Chawla 14 0 311 12 ## 2 IK Pathan 12 0 159 9 ## 3 YA Abdulla 9 1 103 8 ## 4 RJ Harris 5 0 87 7 ## 5 P Awana 11 0 149 6 ## 6 S Sreesanth 6 0 101 5 ## 7 Azhar Mahmood 8 0 74 5 ## 8 Sandeep Sharma 8 1 101 4 ## 9 AR Patel 5 0 94 4 ## 10 VRV Singh 6 0 70 4 ## .. ... ... ... ... ... #Ashwin is the highest wicket takes for CSK against DD teamBowlingPerfOppnAllMatches(csk_dd_matches,main="Chennai Super Kings",opposition="Delhi Daredevils") ## Source: local data frame [26 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (dbl) ## 1 R Ashwin 9 0 233 17 ## 2 JA Morkel 11 0 338 10 ## 3 DJ Bravo 5 0 135 8 ## 4 SB Jakati 4 0 140 6 ## 5 L Balaji 10 0 117 6 ## 6 MM Sharma 1 0 99 6 ## 7 RA Jadeja 2 0 85 4 ## 8 IC Pandey 1 0 80 4 ## 9 BW Hilfenhaus 5 0 53 4 ## 10 A Nehra 1 0 25 4 ## .. ... ... ... ... ... teamBowlingPerfOppnAllMatches(dc_kkr_matches,"Deccan Chargers","Kolkata Knight Riders") ## Source: local data frame [26 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (dbl) ## 1 RP Singh 11 0 161 7 ## 2 PP Ojha 11 0 196 6 ## 3 WPUJC Vaas 4 0 67 5 ## 4 A Symonds 12 0 100 4 ## 5 DW Steyn 8 0 88 4 ## 6 A Mishra 8 0 68 3 ## 7 Jaskaran Singh 6 0 53 3 ## 8 SB Styris 7 0 79 2 ## 9 RJ Harris 4 0 20 2 ## 10 Harmeet Singh 10 0 84 1 ## .. ... ... ... ... ... 10. Team bowler’s wickets in IPL Twenty20 (all matches with opposing IPL team)

This provided a graphical plot of the tables above

# Dirk Nannes and Umesh Yadav top for DD against CSK teamBowlersWicketsOppnAllMatches(csk_dd_matches,"Delhi Daredevils","Chennai Superkings")

# SL Malinga and Munaf Patel lead in MI vs PW clashes teamBowlersWicketsOppnAllMatches(mi_pw_matches,"Mumbai Indians","Pune Warrors")

teamBowlersWicketsOppnAllMatches(dc_kkr_matches,"Kolkata Knight Riders","Deccan Chargers",top=10)

m <-teamBowlersWicketsOppnAllMatches(kxip_rcb_matches,"Royal Challengers Bangalore","Kings XI Punjab",plot=FALSE) m ## Source: local data frame [20 x 2] ## ## bowler wickets ## (fctr) (int) ## 1 S Aravind 8 ## 2 Z Khan 7 ## 3 MA Starc 7 ## 4 HV Patel 6 ## 5 P Kumar 5 ## 6 YS Chahal 5 ## 7 JH Kallis 4 ## 8 R Vinay Kumar 3 ## 9 A Kumble 3 ## 10 CH Gayle 3 ## 11 AB McDonald 3 ## 12 VR Aaron 3 ## 13 DW Steyn 2 ## 14 CK Langeveldt 2 ## 15 DL Vettori 2 ## 16 M Kartik 2 ## 17 RE van der Merwe 2 ## 18 R Rampaul 1 ## 19 JA Morkel 1 ## 20 AB Dinda 1 11. Team bowler vs batsmen in Twenty20(all matches with opposing IPL team)

These plots show how the IPL bowlers fared against the batsmen. It shows which of the opposing IPL teams batsmen were able to score the most runs

teamBowlersVsBatsmenOppnAllMatches(rr_sh_matches,'Rajasthan Royals',"Sunrisers Hyderabd",top=5)

teamBowlersVsBatsmenOppnAllMatches(kxip_rcb_matches,"Kings XI Punjab","Royal Challengers Bangalore",top=3)

teamBowlersVsBatsmenOppnAllMatches(dc_kkr_matches,"Deccan Chargers","Kolkata Knight Riders")

12. Team bowler’s wicket kind in Twenty20(caught,bowled,etc) (all matches with opposing IPL team)

The charts below show the wicket kind taken by the bowler of the IPL team(caught, bowled, lbw etc)

teamBowlersWicketKindOppnAllMatches(csk_dd_matches,"Delhi Daredevils","Chennai Super Kings",plot=TRUE)

m <- teamBowlersWicketKindOppnAllMatches(mi_pw_matches,"Pune Warriors","Mumbai Indians",plot=FALSE) m[1:30,] ## bowler wicketKind wicketPlayerOut runs ## 1 SB Wagh caught JEC Franklin 31 ## 2 R Sharma caught SR Tendulkar 64 ## 3 AC Thomas caught AT Rayudu 69 ## 4 M Kartik stumped RE Levi 70 ## 5 AB Dinda caught AT Rayudu 150 ## 6 AB Dinda caught RG Sharma 150 ## 7 M Kartik stumped KD Karthik 70 ## 8 MN Samuels bowled SA Yadav 21 ## 9 R Sharma bowled KA Pollard 64 ## 10 AB Dinda caught JEC Franklin 150 ## 11 WD Parnell caught SL Malinga 64 ## 12 AB Dinda lbw Harbhajan Singh 150 ## 13 Yuvraj Singh caught RT Ponting 61 ## 14 AJ Finch caught SR Tendulkar 11 ## 15 MR Marsh lbw KD Karthik 24 ## 16 AC Thomas caught AC Blizzard 69 ## 17 Yuvraj Singh caught SR Tendulkar 61 ## 18 Yuvraj Singh caught AT Rayudu 61 ## 19 R Sharma caught RG Sharma 64 ## 20 R Sharma caught TL Suman 64 ## 21 JE Taylor caught A Symonds 34 ## 22 JE Taylor caught KA Pollard 34 ## 23 B Kumar caught JEC Franklin 50 ## 24 MJ Clarke run out RG Sharma 9 ## 25 A Nehra caught SR Tendulkar 19 ## 26 A Nehra caught RJ Peterson 19 ## 27 B Kumar bowled AT Rayudu 50 ## 28 A Nehra run out NLTC Perera 19 ## 29 AB Dinda caught Harbhajan Singh 150 ## 30 WD Parnell run out SL Malinga 64 teamBowlersWicketKindOppnAllMatches(dc_kkr_matches,"Kolkata Knight Riders",'Deccan Chargers',plot=TRUE)

13. Team bowler’s wicket taken and runs conceded in Twenty20(all matches with opposing IPL team) teamBowlersWicketRunsOppnAllMatches(csk_ktk_matches,"Kochi Tuskers Kerala","Chennai Super Kings")

m <-teamBowlersWicketRunsOppnAllMatches(mi_pw_matches,"Mumbai Indians","Pune Warriors",plot=FALSE) m[1:30,] ## Source: local data frame [30 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (dbl) ## 1 AG Murtaza 4 0 18 2 ## 2 SL Malinga 9 1 143 10 ## 3 AN Ahmed 5 0 40 4 ## 4 MM Patel 6 1 88 7 ## 5 KA Pollard 6 0 99 5 ## 6 JEC Franklin 4 0 64 1 ## 7 Harbhajan Singh 7 0 85 6 ## 8 PP Ojha 8 0 95 4 ## 9 MG Johnson 5 0 41 4 ## 10 R Dhawan 1 0 27 0 ## .. ... ... ... ... ... 14. Plot of wins vs losses between teams in IPL T20 confrontations setwd("C:/software/cricket-package/york-test/yorkrData/IPL/IPL-T20-matches") plotWinLossBetweenTeams("Chennai Super Kings","Delhi Daredevils")

plotWinLossBetweenTeams("Deccan Chargers","Kolkata Knight Riders",".")

plotWinLossBetweenTeams('Kings XI Punjab',"Royal Challengers Bangalore",".")

plotWinLossBetweenTeams("Mumbai Indians","Pune Warriors",".")

plotWinLossBetweenTeams('Rajasthan Royals',"Sunrisers Hyderabad",".")

plotWinLossBetweenTeams('Chennai Super Kings',"Mumbai Indians",".")

Conclusion

This post included all functions for all IPL Twenty20 matches between any 2 IPL teams. As before the data frames are already available. You can load the data and begin to use them. If more insights from the dataframe are possible do go ahead. But please do attribute the source to Cricheet (http://cricsheet.org), my package yorkr and my blog. Do give the functions a spin for yourself!

You may also like

  1. yorkr pads up for the Twenty20s: Part 1- Analyzing team“s match performance
  2. yorkr pads up for the Twenty20s:Part 4- Individual batting and bowling performances
  3. Introducing cricket package yorkr: Part 2-Trapped leg before wicket!
  4. Introducing cricket package yorkr:Part 4-In the block hole!
  5. Introducing cricketr! : An R package to analyze performances of cricketers
  6. Cricket analytics with cricketr
  7. OpenCV: Fun with filters and convolution
  8. To Hadoop, or not to Hadoop
  9. Close encounters with the future
  10. Presentation on ‘Evolution to LTE’

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R editor improvements for the next release of Bio7

Wed, 2016-04-20 05:32

(This article was first published on R, and kindly contributed to R-bloggers)

20.04.2016

For the upcoming release of Bio7 I worked hard to improve the R editor features. So I added some new features and improvements to assist in the creation of R scripts in Bio7.
One of the highlights is the newly integrated dynamic code analysis when writing an R script.

Here a short overview of some new R editor features I integrated so far:

  • Detect and display unused variables and functions

  • Detect missing functions and variables
  • Added a new code assist list when triggered in function calls

  • Check of function arguments

  • Check of wrong function argument

  • Available help for mistyped functions (% similarity)

  • Improved Code Completion in general
  • Added a toolbar with two HTML help actions to the context help dialog (if you hover over a method)

  • Improved Code Completion to list local scope self defined variables and functions

  • Added  an refactor action to extract variables
  • Added an refactor action to extract functions

  • Added more Quickfixes
  • Quickfixes can now be opened by hovering over a problem or error marker
  • Added an automatic close action of parentheses, brackets, braces and strings in the editor
  • Improved the general parsing speed
  • Added new key shortcuts to faster perform R editor actions
  • New action and key shortcut to open the plot preferences faster
  • Added new on/off preferences for the new features
  • Improved the display of the Outline view for variables and functions

There is of course some room for improvements and there are some rough edges in the implementation of the dynamic code analysis since the R language is a highly dynamic language. However I hope that this features will be a help in the creation of correct R scripts in the R editor of the next Bio7 release.

To leave a comment for the author, please follow the link and comment on their blog: R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Data Exploration with Tables exercises

Wed, 2016-04-20 04:10

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The table() function is intended for use during the Data Exploration phase of Data Analysis. The table() function performs categorical tabulation of data. In the R programming language, “categorical” variables are also called “factor” variables.

The tabulation of data categories allows for Cross-Validation of data. Thereby, finding possible flaws within a dataset, or possible flaws within the processes used to create the dataset. The table() function allows for logical parameters to modify data tabulation.

Beyond Data Exploration, the table() function allows for the inference of statistics within multivariate tables, (or contingency tables), of two or more variables.

Answers to the exercises are available here.

Exercise 1

Basic tabulation of categorical data

This is the first dataset to explore:
Gender <- c("Female","Female","Male","Male")
Restaurant <- c("Yes","No","Yes","No")
Count <- c(220, 780, 400, 600)
DiningSurvey <- data.frame(Gender, Restaurant, Count)
DiningSurvey

Using the table() function, compare the Gender and Restaurant variables in the above dataset.

Exercise 2

The table() function modified with a logical vector.

Use the logical vector of “Count > 650” to summarize the data.

Exercise 3

The useNA & is.na arguments find missing values.

First append the dataset with missing values:
DiningSurvey$Restaurant <- c("Yes", "No", "Yes", NA)

Apply the “useNA” argument to find missing Restaurant data.

Next, apply the “is.na()” argument to find missing Restaurant data by Gender.

Exercise 4

The “exclude =” parameter excludes columns of data.

Exclude one of the dataset’s Genders with the “exclude” argument.

Exercise 5

The “margin.table()” function requires data in array form, and generates tables of marginal frequencies. The margin.table() function summarizes arrays within a given index.

First, generate array format data:
RentalUnits <- matrix(c(45,37,34,10,15,12,24,18,19),ncol=3,byrow=TRUE)
colnames(RentalUnits) <- c("Section1","Section2","Section3")
rownames(RentalUnits) <- c("Rented","Vacant","Reserved")
RentalUnits <- as.table(RentalUnits)

Find the amount of Occupancy summed over Sections.

Next, find the amount of Units summed by Section.

Exercise 6

The prop.table() function creates tables of proportions within the dataset.

Use the “prop.table() function to create a basic table of proportions.

Next, find row percentages, and column percentages.

Exercise 7

The ftable() function generates multidimensional n-way tables, or “flat” contingency tables.

Use the ftable() function to summarize the dataset, “RentalUnits”.

Exercise 8

The “summary() function performs an independence test of the dataset’s factors.

Use “summary()” to perform a Chi-Square Test of Independence.

Exercise 9

“as.data.frame()” summarizes frequencies of data arrays.

Use “as.data.frame()” to list frequencies within the “RentalUnits” array.

Exercise 10

The “addmargins()” function creates arbitrary margins on multivariate arrays.

Use “addmargins()” to append “RentalUnits” with sums.

Next, summarize columns with “RentalUnits”.

Next, summarize rows with “RentalUnits”.

Finally, combine “addmargins()” and “prop.table()” to summarize proportions within “RentalUnits”. What is statistically inferred about sales of rental units by section?

Image by by IngerAlHaosului.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Le Monde puzzle [#959]

Tue, 2016-04-19 18:16

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Another of those arithmetic Le Monde mathematical puzzle:

Find an integer A such that A is the sum of the squares of its four smallest dividers (including1) and an integer B such that B is the sum of the third poser of its four smallest factors. Are there such integers for higher powers?

This begs for a brute force resolution checking the integers until a solution appears. The only exciting part is providing the four smallest factors but a search on Stack overflow led to an existing R function:

FUN <- function(x) { x <- as.integer(x) div <- seq_len(abs(x)) return(div[x %% div == 0L]) }

(which uses the 0L representation I was unaware of) and hence my R code:

quest1<-function(n=2){ I=4 stop=TRUE while ((stop)&(I<1e6)){ I=I+1 dive=FUN(I) if (length(dive)>3) stop=(I!=sum(sort(dive)[1:4]^n)) } return(I) }

But this code only seems to work for n=2 as produces A=130: it does not return any solution for the next value of n… As shown by the picture below, which solely exhibits a solution for n=2,5, A=17864 (in the second case), there is no solution less than 10⁶ for n=3,4,6,..9. So, unless I missed a point in the question, the solutions for n>2 are larger if they at all exist.

A resolution got published yesterday night in Le Monde  and (i) there is indeed no solution for n=3 (!), (ii) there are solutions for n=4 (1,419,874) and n=5 (1,015,690), which are larger than the 10⁶ bound I used in the R code, (iii) there is supposedly no solution for n=5!, when the R code found that 17,864=1⁵+2⁵+4⁵+7⁵… It is far from the first time the solution is wrong or incomplete!

Filed under: Kids, R Tagged: Le Monde, mathematical puzzle, R, Stack Echange

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Notes from 2nd Bayesian Mixer Meetup

Tue, 2016-04-19 16:41

(This article was first published on mages' blog, and kindly contributed to R-bloggers)

Last Friday the 2nd Bayesian Mixer Meetup (@BayesianMixer) took place at Cass Business School, thanks to Pietro Millossovich and Andreas Tsanakas, who helped to organise the event.
Bayesian Mixer at Cass

First up was Davide De March talking about the challenges in biochemistry experimentation, which are often characterised by complex and emerging relations among components.

The very little prior knowledge about complex molecules bindings left a fertile field for a probabilistic graphical model. In particular, Bayesian networks can help the investigator in the definition of a conditional dependence/independence structure where a joint multivariate probability distribution is determined. Hence, the use of Bayesian network can lead to a more efficient way of designing experiments.

Davide De March: Bayesian Networks to design optimal experiments

The second act of the night was Mick Cooney, presenting ideas of using growth curves to estimate the ultimate amounts paid in insurance by some cohort of policies.

The talk showed a model for these curves, discussed the implementation in Stan and how posterior predictive checks can be used to assess the output of the model.

Mick Cooney: Bayesian Modelling for Loss Curves in Insurance

Thanks again to everyone who helped to make the event a success, particularly our speakers and Jon Sedar of Applied AI.

We are planning to run another event in mid-June. Please get in touch via our Meetup site with ideas and talk proposals. The 4th R in Insurance conference will take place on 11 July 2016 at Cass Business School. Send in your abstract by 28 March and register now.

This post was originally published on mages’ blog.

To leave a comment for the author, please follow the link and comment on their blog: mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

R’s Growth Continues to Accelerate

Tue, 2016-04-19 11:25

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

Each year I update the growth in R’s capability on The Popularity of Data Analysis Software. And each year, I think R’s incredible rate of growth will finally slow down. Below is a graph of the latest data, and as you can see, R’s growth continues to accelerate.

Since I’ve added coverage for many more software packages, I have restructured the main article to reflect the value of each type of data. They now appear in this order:

  • Job Advertisements
  • Scholarly Articles
  • IT Research Firm Reports
  • Surveys of Use
  • Books
  • Blogs
  • Discussion Forum Activity
  • Programming Popularity Measures
  • Sales & Downloads
  • Competition Use
  • Growth in Capability

Growth in Capability remains last because I only have complete data for R. To save you from having to dig through all 40+ pages of the article, the updated section is below. I’ll be updating several other sections in the coming weeks. If you’re interested, you can follow this blog, or follow me on Twitter as @BobMuenchen.

If you haven’t yet learned R, I recommend my books R for SAS and SPSS Users and R for Stata Users. I do R training as well, but that’s booked up through the end of August, so please plan ahead.

Growth in Capability

The capability of analytics software has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data are hard to obtain. John Fox (2009) acquired them for R’s main distribution site http://cran.r-project.org/ for each version of R. To simplify ongoing data collection, I kept only the values for the last version of R released each year (usually in November or December), and collected data through the most recent complete year.

These data are displayed in Figure 10. The right-most point is for version 3.2.3, released 12/10/2015. The growth curve follows a rapid parabolic arc (quadratic fit with R-squared=.995).

Figure 10. Number of R packages available on its main distribution site for the last version released in each year.

To put this astonishing growth in perspective, let us compare it to the most dominant commercial package, SAS. In version, 9.3, SAS contained around 1,200 commands that are roughly equivalent to R functions (procs, functions, etc. in Base, Stat, ETS, HP Forecasting, Graph, IML, Macro, OR, and QC). In 2015, R added 1,357 packages, counting only CRAN, or approximately 27,642 functions. During 2015 alone, R added more functions/procs than SAS Institute has written in its entire history.

Of course while SAS and R commands solve many of the same problems, they are certainly not perfectly equivalent. Some SAS procedures have many more options to control their output than R functions do, so one SAS procedure may be equivalent to many R functions. On the other hand, R functions can nest inside one another, creating nearly infinite combinations. SAS is now out with version 9.4 and I have not repeated the arduous task of recounting its commands. If SAS Institute would provide the figure, I would include it here. While the comparison is far from perfect, it does provide an interesting perspective on the size and growth rate of R.

As rapid as R’s growth has been, these data represent only the main CRAN repository. R has eight other software repositories, such as Bioconductor, that are not included in Fig. 10. A program run on 4/19/2016 counted 11,531 R packages at all major repositories, 8,239 of which were at CRAN. (I excluded the GitHub repository since it contains duplicates to CRAN that I could not easily remove.) So the growth curve for the software at all repositories would be approximately 40% higher on the y-axis than the one shown in Figure 10.

As with any analysis software, individuals also maintain their own separate collections available on their web sites. However, those are not easily counted.

What’s the total number of R functions? The Rdocumentation site shows the latest counts of both packages and functions on CRAN, Bioconductor and GitHub. They indicate that there is an average of 19.78 functions per package. Given the package count of 11,531, as of 4/19/2016 there were approximately 228,103 total functions in R. In total, R has approximately 190 times as many commands as its main commercial competitor, SAS.

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

Exploring NYC Taxi Data with Microsoft R Server and HDInsight

Tue, 2016-04-19 10:45

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

As I mentioned yesterday, Microsoft R Server now available for HDInsight, which means that you can now run R code (including the big-data algorithms of Microsoft R Server) on a managed, cloud-based Hadoop instance. 

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R. 

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

The data behind the analysis is public, so if you'd like to try it out yourself the Microsoft R Server code for the analysis is available on Github, and you can read more details about the analysis in the detailed writeup, linked below. The link also contains details about data exploration and modeling, including references to additional distributed machine learning functions in R, which may be explored to improve model performance.

Scalable Data Analysis using Microsoft R Server (MRS) on Hadoop MapReduce: Using MRS on Azure HDInsight (Premium) for Exploring and Modeling the 2013 New York City Taxi Trip and Fare Data

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs

yorkr crashes the IPL party ! – Part 1

Tue, 2016-04-19 10:33

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

 

Where tireless striving stretches its arms towards perfection

Where the clear stream of reason has not lost its way

Into the dreary desert sand of dead habit

Rabindranath Tagore Introduction

In this post, my R package yorkr crashes the IPL party! In my earlier posts I had already created functions for handling Twenty20 matches. I now use these functions to analyze the IPL T20 matches. This package is based on data from Cricsheet. The T20 functionality were added in the following posts

  1. yorkr pads up for the Twenty20s: Part 1- Analyzing team“s match performance.
  2. yorkr pads up for the Twenty20s: Part 2-Head to head confrontation between teams
  3. yorkr pads up for the Twenty20s:Part 3:Overall team performance against all oppositions!
  4. yorkr pads up for the Twenty20s:Part 4- Individual batting and bowling performances

The yorkr package provides functions to convert the yaml files to more easily R consumable entities, namely dataframes. All converted files for ODI,T20 and IPL are available for use at yorkrData.

The IPL T20 matches can be downloaded from IPL-T20-matches

This post can be viewed at RPubs at yorkrIPLT20-Part1 or can also be downloaded as a PDF document yorkrIPLT20-1.pdf

1. Functions related to team performance in match

The following function can be used to analyze the IPL team performance in a T20 match

  1. teamBattingScorecardMatch()
  2. teamBatsmenPartnershipMatch()
  3. teamBatsmenVsBowlersMatch()
  4. teamBowlingScorecardMatch()
  5. teamBowlingWicketKindMatch()
  6. teamBowlingWicketRunsMatch()
  7. teamBowlingWicketRunsMatch()
  8. teamBowlingWicketMatch()
  9. teamBowlersVsBatsmenMatch()
  10. matchWormGraph()
2. Install the package from CRAN library(yorkr) rm(list=ls()) 2a. New functionality for Twenty20

The functions that were used to convert the Twenty20 yaml files to RData are

  1. convertYaml2RDataframeT20
  2. convertAllYaml2RDataframesT20

Note 1: While I have already converted the IPL T20 files, you will need to use these functions for future IPL matches

Note 2: This post includes some cosmetic changes made over yorkr_0.0.4, where I make the plot title more explicit. The functionality will be available in a few weeks from now in yorkr_0.0.5

3. Convert and save T20 yaml file to dataframe

This function will convert a T20 IPL yaml file, in the format as specified in Cricsheet to dataframe. This will be saved as as RData file in the target directory. The name of the file wil have the following format team1-team2-date.RData. An example of how a yaml file can be converted to a dataframe and saved is shown below.

convertYaml2RDataframeT20("335982.yaml",".",".") ## [1] "./335982.yaml" ## [1] "first loop" ## [1] "second loop" 4. Convert and save all T20 yaml files to dataframes

This function will convert all IPL T20 yaml files from a source directory to dataframes, and save it in the target directory, with the names as mentioned above. Since I have already done this, I will not be executing this again. You can download the zip of all the converted RData files from Github at IPL-T20-matches

#convertAllYaml2RDataframesT20("./IPL","./data") 5. yorkrData – A Github repositiory

Cricsheet had a total of 518 IPL Twenty20 matches. Out of which 9 files seemed to have problem. The remaining 509 T20 matches have been converted to RData.

All the converted RData files can be accessed from my Github link yorkrData under the folder IPL-T20-matches

You can download the the zip of the files and use it directly in the functions as follows

6. Load the match data as dataframes

For this post I will be using the IPL Twenty20 match data from 5 random matches between 10 different opposing IPL teams. For this I will directly use the converted RData files rather than getting the data through the getMatchDetails() as shown below

With the RData we can load the data in 2 ways

A. With getMatchDetails()
  1. With getMatchDetails() using the 2 teams and the date on which the match occured
sh_mi <- getMatchDetails("Sunrisers Hyderabad","Royal Challengers Bangalore","2014-05-20",dir=".") dim(sh_mi) ## [1] 244 25

or

B.Directly load RData into your code.

The match details will be loaded into a dataframe called ’overs’ which you can assign to a suitable name as below

The randomly selected IPL T20 matches are

  • Sunrisers Hyderabad vs Royal Challengers Bangalore, 2014-05-20
  • Rajasthan Royals vs Pune Warriors, 2013-05-05
  • Deccan Chargers vs Chennai Super Kings, 2008-05-27
  • Kings Xi Punjab vs Delhi Daredevils, 2014-05-25
  • Kolkata Knight Riders vs Mumbai Indian, 2014-05-14
setwd("C:/software/cricket-package/cricsheet/cleanup/IPL/part1") load("Sunrisers Hyderabad-Royal Challengers Bangalore-2014-05-20.RData") sh_rcb <- overs load("Rajasthan Royals-Pune Warriors-2013-05-05.RData") rr_pw <- overs load("Deccan Chargers-Chennai Super Kings-2008-05-27.RData") dc_csk <- overs load("Kings XI Punjab-Delhi Daredevils-2014-05-25.RData") kxp_dd <-overs load("Kolkata Knight Riders-Mumbai Indians-2014-05-14.RData") kkr_mi <- overs 7. Team batting scorecard

Compute and display the batting scorecard of the teams in the match.

teamBattingScorecardMatch(kkr_mi,'Mumbai Indians') ## Total= 134 ## Source: local data frame [7 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (dbl) (dbl) (dbl) ## 1 LMP Simmons 13 2 0 12 ## 2 CM Gautam 9 1 0 8 ## 3 AT Rayudu 26 3 1 33 ## 4 RG Sharma 45 4 2 51 ## 5 CJ Anderson 12 1 1 18 ## 6 KA Pollard 11 0 0 10 ## 7 AP Tare 3 0 0 2 teamBattingScorecardMatch(kkr_mi,'Kolkata Knight Riders') ## Total= 137 ## Source: local data frame [5 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (dbl) (dbl) (dbl) ## 1 RV Uthappa 52 9 3 80 ## 2 G Gambhir 17 1 0 14 ## 3 MK Pandey 21 0 0 14 ## 4 YK Pathan 13 3 0 20 ## 5 Shakib Al Hasan 8 1 0 9 teamBattingScorecardMatch(sh_rcb,'Sunrisers Hyderabad') ## Total= 154 ## Source: local data frame [5 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (dbl) (dbl) (dbl) ## 1 S Dhawan 39 7 1 50 ## 2 DA Warner 43 3 4 59 ## 3 NV Ojha 19 0 2 24 ## 4 AJ Finch 9 1 0 11 ## 5 DJG Sammy 4 0 1 10 teamBattingScorecardMatch(rr_pw,'Pune Warriors') ## Total= 167 ## Source: local data frame [5 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (int) (dbl) (dbl) ## 1 RV Uthappa 41 8 1 54 ## 2 AJ Finch 32 7 0 45 ## 3 Yuvraj Singh 11 1 1 15 ## 4 MR Marsh 21 2 3 35 ## 5 AD Mathews 15 2 0 18 teamBattingScorecardMatch(dc_csk,'Chennai Super Kings') ## Total= 137 ## Source: local data frame [5 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (int) (dbl) (dbl) ## 1 PA Patel 27 3 0 20 ## 2 SP Fleming 9 3 0 14 ## 3 SK Raina 41 5 2 54 ## 4 MS Dhoni 24 4 1 37 ## 5 JA Morkel 12 1 0 12 teamBattingScorecardMatch(kxp_dd,'Kings XI Punjab') ## Total= 104 ## Source: local data frame [5 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (dbl) (dbl) (dbl) ## 1 V Sehwag 7 2 0 9 ## 2 M Vohra 37 4 2 47 ## 3 GJ Maxwell 2 0 0 0 ## 4 DA Miller 34 4 2 47 ## 5 GJ Bailey 1 0 0 1 teamBattingScorecardMatch(kkr_mi,'Mumbai Indians') ## Total= 134 ## Source: local data frame [7 x 5] ## ## batsman ballsPlayed fours sixes runs ## (fctr) (int) (dbl) (dbl) (dbl) ## 1 LMP Simmons 13 2 0 12 ## 2 CM Gautam 9 1 0 8 ## 3 AT Rayudu 26 3 1 33 ## 4 RG Sharma 45 4 2 51 ## 5 CJ Anderson 12 1 1 18 ## 6 KA Pollard 11 0 0 10 ## 7 AP Tare 3 0 0 2 8. Plot the team batting partnerships

The functions below plot the team batting partnership in the match Note: Many of the plots include an additional parameters plot which is either TRUE or FALSE. The default value is plot=TRUE. When plot=TRUE the plot will be displayed. When plot=FALSE the data frame will be returned to the user. The user can use this to create an interactive chary using one of the packages like rcharts, ggvis,googleVis or plotly.

teamBatsmenPartnershipMatch(kkr_mi,'Mumbai Indians','Kolkata Knight Riders')

teamBatsmenPartnershipMatch(sh_rcb,'Sunrisers Hyderabad','Royal Challengers Bangalore',plot=TRUE)

teamBatsmenPartnershipMatch(rr_pw,'Pune Warriors','Rajasthan Royals')

teamBatsmenPartnershipMatch(dc_csk,'Chennai Super Kings','Deccan Chargers',plot=FALSE) ## batsman nonStriker runs ## 1 PA Patel SP Fleming 10 ## 2 PA Patel SK Raina 10 ## 3 SP Fleming PA Patel 14 ## 4 SK Raina PA Patel 19 ## 5 SK Raina MS Dhoni 14 ## 6 SK Raina JA Morkel 21 ## 7 MS Dhoni SK Raina 37 ## 8 JA Morkel SK Raina 12 teamBatsmenPartnershipMatch(kxp_dd,'Kings XI Punjab','Delhi Daredevils',plot=TRUE)

9. Batsmen vs Bowler

The function below computes and plots the performances of the batsmen vs the bowlers. As before the plot parameter can be set to TRUE or FALSE. By default it is plot=TRUE

teamBatsmenVsBowlersMatch(sh_rcb,"Sunrisers Hyderabad","Royal Challengers Bangalore", plot=TRUE)

teamBatsmenVsBowlersMatch(kkr_mi,'Kolkata Knight Riders','Mumbai Indians')

m <- teamBatsmenVsBowlersMatch(rr_pw,'Pune Warriors','Rajasthan Royals',plot=FALSE) m ## Source: local data frame [20 x 3] ## Groups: batsman [?] ## ## batsman bowler runsConceded ## (fctr) (fctr) (dbl) ## 1 RV Uthappa A Chandila 12 ## 2 RV Uthappa JP Faulkner 1 ## 3 RV Uthappa SR Watson 13 ## 4 RV Uthappa KK Cooper 2 ## 5 RV Uthappa SK Trivedi 18 ## 6 RV Uthappa STR Binny 8 ## 7 AJ Finch A Chandila 11 ## 8 AJ Finch JP Faulkner 12 ## 9 AJ Finch SR Watson 5 ## 10 AJ Finch KK Cooper 8 ## 11 AJ Finch SK Trivedi 9 ## 12 Yuvraj Singh KK Cooper 0 ## 13 Yuvraj Singh SK Trivedi 5 ## 14 Yuvraj Singh STR Binny 10 ## 15 MR Marsh JP Faulkner 13 ## 16 MR Marsh SR Watson 7 ## 17 MR Marsh KK Cooper 15 ## 18 AD Mathews JP Faulkner 7 ## 19 AD Mathews SR Watson 3 ## 20 AD Mathews KK Cooper 8 teamBatsmenVsBowlersMatch(dc_csk,"Chennai Super Kings","Deccan Chargers")

teamBatsmenVsBowlersMatch(kxp_dd,"Kings XI Punjab","Delhi Daredevils")

10. Bowling Scorecard

This function provides the bowling performance, the number of overs bowled, maidens, runs conceded and wickets taken for each match

teamBowlingScorecardMatch(kkr_mi,'Kolkata Knight Riders') ## Source: local data frame [6 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (dbl) ## 1 M Morkel 4 0 35 2 ## 2 UT Yadav 3 0 24 0 ## 3 Shakib Al Hasan 4 0 21 1 ## 4 SP Narine 4 0 18 1 ## 5 PP Chawla 4 0 32 1 ## 6 YK Pathan 1 0 10 0 teamBowlingScorecardMatch(kkr_mi,'Mumbai Indians') ## Source: local data frame [6 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (dbl) ## 1 SL Malinga 4 0 30 1 ## 2 JJ Bumrah 3 0 23 0 ## 3 Harbhajan Singh 4 0 22 2 ## 4 PP Ojha 4 0 25 0 ## 5 LMP Simmons 3 0 34 1 ## 6 KA Pollard 1 0 7 0 teamBowlingScorecardMatch(sh_rcb,"Sunrisers Hyderabad") ## Source: local data frame [7 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (dbl) ## 1 B Kumar 4 0 27 2 ## 2 DW Steyn 4 0 23 1 ## 3 Parvez Rasool 4 0 26 1 ## 4 KV Sharma 3 0 27 1 ## 5 Y Venugopal Rao 1 0 7 0 ## 6 IK Pathan 3 0 28 1 ## 7 DJG Sammy 1 0 19 0 teamBowlingScorecardMatch(rr_pw,'Pune Warriors') ## Source: local data frame [6 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (dbl) ## 1 B Kumar 4 0 38 1 ## 2 K Upadhyay 3 0 29 0 ## 3 WD Parnell 4 0 27 3 ## 4 R Sharma 4 0 38 0 ## 5 Yuvraj Singh 2 0 16 0 ## 6 AD Mathews 3 0 34 1 teamBowlingScorecardMatch(dc_csk,"Chennai Super Kings") ## Source: local data frame [5 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (int) ## 1 M Ntini 4 0 24 1 ## 2 MS Gony 4 0 21 1 ## 3 JA Morkel 4 0 37 3 ## 4 M Muralitharan 4 0 22 1 ## 5 L Balaji 4 0 34 2 teamBowlingScorecardMatch(kxp_dd,"Kings XI Punjab") ## Source: local data frame [5 x 5] ## ## bowler overs maidens runs wickets ## (fctr) (int) (int) (dbl) (int) ## 1 P Awana 3 1 15 2 ## 2 AR Patel 4 0 28 2 ## 3 MG Johnson 4 1 27 2 ## 4 Karanveer Singh 4 0 22 2 ## 5 R Dhawan 4 0 22 2 11. Wicket Kind

The plots below provide the bowling kind of wicket taken by the bowler (caught, bowled, lbw etc.)

teamBowlingWicketKindMatch(kkr_mi,'Kolkata Knight Riders','Mumbai Indians')

m <- teamBowlingWicketKindMatch(rr_pw,'Pune Warriors','Rajasthan Royals',plot=FALSE) m ## bowler wicketKind wicketPlayerOut runs ## 1 AD Mathews caught R Dravid 34 ## 2 WD Parnell bowled SR Watson 27 ## 3 B Kumar caught AM Rahane 38 ## 4 WD Parnell caught BJ Hodge 27 ## 5 WD Parnell caught SV Samson 27 ## 6 K Upadhyay noWicket noWicket 29 ## 7 R Sharma noWicket noWicket 38 ## 8 Yuvraj Singh noWicket noWicket 16 teamBowlingWicketKindMatch(dc_csk,"Chennai Super Kings","Deccan Chargers")

teamBowlingWicketKindMatch(kxp_dd,"Kings XI Punjab","Delhi Daredevils",plot=TRUE)

teamBowlingWicketKindMatch(sh_rcb,"Royal Challengers Bangalore","Sunrisers Hyderabad")

12. Wicket vs Runs conceded

The plots below provide the wickets taken and the runs conceded by the bowler in the match

teamBowlingWicketRunsMatch(dc_csk,"Deccan Chargers", "Chennai Super Kings")

teamBowlingWicketRunsMatch(kxp_dd,"Kings XI Punjab","Delhi Daredevils",plot=TRUE)

teamBowlingWicketRunsMatch(sh_rcb,"Sunrisers Hyderabad","Royal Challengers Bangalore")

teamBowlingWicketRunsMatch(kkr_mi,'Kolkata Knight Riders','Mumbai Indians')

m <- teamBowlingWicketKindMatch(rr_pw,'Pune Warriors','Rajasthan Royals',plot=FALSE) m ## bowler wicketKind wicketPlayerOut runs ## 1 AD Mathews caught R Dravid 34 ## 2 WD Parnell bowled SR Watson 27 ## 3 B Kumar caught AM Rahane 38 ## 4 WD Parnell caught BJ Hodge 27 ## 5 WD Parnell caught SV Samson 27 ## 6 K Upadhyay noWicket noWicket 29 ## 7 R Sharma noWicket noWicket 38 ## 8 Yuvraj Singh noWicket noWicket 16 13. Wickets taken by bowler

The plots provide the wickets taken by the bowler

teamBowlingWicketMatch(kkr_mi,'Kolkata Knight Riders','Mumbai Indians')

m <- teamBowlingWicketMatch(rr_pw,'Pune Warriors','Rajasthan Royals',plot=FALSE) m ## bowler wicketKind wicketPlayerOut runs ## 1 AD Mathews caught R Dravid 34 ## 2 WD Parnell bowled SR Watson 27 ## 3 B Kumar caught AM Rahane 38 ## 4 WD Parnell caught BJ Hodge 27 ## 5 WD Parnell caught SV Samson 27 ## 6 K Upadhyay noWicket noWicket 29 ## 7 R Sharma noWicket noWicket 38 ## 8 Yuvraj Singh noWicket noWicket 16 teamBowlingWicketMatch(sh_rcb,"Royal Challengers Bangalore","Sunrisers Hyderabad")

teamBowlingWicketMatch(dc_csk,"Deccan Chargers", "Chennai Super Kings")

teamBowlingWicketMatch(kxp_dd,"Kings XI Punjab","Delhi Daredevils",plot=TRUE)

14. Bowler Vs Batsmen

The functions compute and display how the different bowlers of the country performed against the batting opposition.

teamBowlersVsBatsmenMatch(dc_csk,"Deccan Chargers", "Chennai Super Kings")

teamBowlersVsBatsmenMatch(kxp_dd,"Kings XI Punjab","Delhi Daredevils",plot=TRUE)

m <-teamBowlersVsBatsmenMatch(sh_rcb,"Sunrisers Hyderabad","Royal Challengers Bangalore",plot=FALSE) m ## Source: local data frame [26 x 3] ## Groups: bowler [?] ## ## bowler batsman runsConceded ## (fctr) (fctr) (dbl) ## 1 B Kumar CH Gayle 5 ## 2 B Kumar PA Patel 4 ## 3 B Kumar V Kohli 6 ## 4 B Kumar AB de Villiers 6 ## 5 B Kumar S Rana 1 ## 6 B Kumar MA Starc 5 ## 7 DW Steyn CH Gayle 7 ## 8 DW Steyn V Kohli 4 ## 9 DW Steyn AB de Villiers 4 ## 10 DW Steyn S Rana 7 ## .. ... ... ... teamBowlersVsBatsmenMatch(rr_pw,'Pune Warriors','Rajasthan Royals')

teamBowlersVsBatsmenMatch(kkr_mi,'Kolkata Knight Riders','Mumbai Indians')

15. Match worm graph

The plots below provide the match worm graph for the IPL Twenty 20 matches

matchWormGraph(dc_csk,"Deccan Chargers", "Chennai Super Kings")

matchWormGraph(kxp_dd,"Kings XI Punjab","Delhi Daredevils")

matchWormGraph(sh_rcb,"Sunrisers Hyderabad","Royal Challengers Bangalore")

matchWormGraph(rr_pw,'Pune Warriors','Rajasthan Royals')

matchWormGraph(kkr_mi,'Kolkata Knight Riders','Mumbai Indians')

Conclusion

This post included all functions between 2 IPL teams from the package yorkr for IPL Twenty20 matches.As mentioned above the yaml match files have been already converted to dataframes and are available for download from Github. Go ahead and give it a try

To be continued. Watch this space!

You may also like

  1. Introducing cricket package yorkr-Part1:Beaten by sheer pace!.
  2. Introducing cricketr! : An R package to analyze performances of cricketers
  3. Simulating a Web Joint in Android
  4. Elements of CRUD with NodeExpress and MongoDB using Enide Studio
  5. Cricket analytics with cricketr
  6. Sixer – R package cricketr’s new Shiny avatar
  7. Natural language processing: What would Shakespeare say?
  8. Experiment with deblurring using OpenCV
  9. Presentation on Wireless Technologies – Part 2

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Categories: Methodology Blogs