Linear optimization and baseball teams
We try to use Integer Linear Programming to build a perfect 25 men roster baseball team. We present our best team below which is the solution of the ILP model we built using the 2015 MLB season player data. If you understand baseball please evaluate our resulting baseball team and drop a comment, so that we know whether ILP can be used to get a decent baseball team. After the table I describe how we arrived at our solution.
Edit: The choice of statistics for our utility index is almost random. The main goal was to model the general constraints and objective function. This code allows to easily add desired statistics and extend the general case to include more sophisticated preferences, for example using the weight vector.
Prerequisites
To follow the process of us setting up the ILP model you should have familitarity with
 Linear algebra
 Linear optimization
 Integer programming
Data preprocessing
Let's read in the 2015 regular season player level data.
dat = read.csv("Baseball Data.csv")
head(dat[,1:4])
## Salary Name POS Bats
## 1 510000 Joc Pederson OF L
## 2 512500 Stephen Vogt 1B L
## 3 3550000 Wilson Ramos C R
## 4 31000000 Clayton Kershaw SP
## 5 15000000 Jhonny Peralta SS R
## 6 2000000 Carlos Villanueva Reliever
The dataset has 199 rows (players). There were NA's
for some players and their game statistics which we replaced with 0. The reason we replaced the missing data with zeros is that when we construct the player utility index missing data won't count towards or against players.
dat[is.na(dat)] = 0
Each baseball player has game statistics associated with them. Below is the list of player level data.
names(dat)
## [1] "Salary" "Name" "POS" "Bats" "Throws" "Team"
## [7] "G" "PA" "HR" "R" "RBI" "SB"
## [13] "BB." "K." "ISO" "BABIP" "AVG" "OBP"
## [19] "SLG" "wOBA" "wRC." "BsR" "Off" "Def"
## [25] "WAR" "playerid"
You can see the statistics description in the collapsible list below or appendix.

Baseball statistics abbreviations
 PA  Plate appearance: number of completed batting appearances
 HR  Home runs: hits on which the batter successfully touched all four bases, without the contribution of a fielding error
 R  Runs scored: number of times a player crosses home plate
 RBI  Run batted in: number of runners who score due to a batters' action, except when batter grounded into double play or reached on an error
 SB  Stolen base: number of bases advanced by the runner while the ball is in the possession of the defense
 ISO  Isolated power: a hitter's ability to hit for extra bases, calculated by subtracting batting average from slugging percentage
 BABIP  Batting average on balls in play: frequency at which a batter reaches a base after putting the ball in the field of play. Also a pitching category
 AVG  Batting average (also abbreviated BA): hits divided by at bats
 OBP  Onbase percentage: times reached base divided by at bats plus walks plus hit by pitch plus sacrifice flies
 SLG  Slugging average: total bases achieved on hits divided by atbats
 wOBA  Some argue that the OPS, onbase plus slugging, formula is flawed and that more weight should be shifted towards OBP (onbase percentage). The statistic wOBA (weighted onbase average) attempts to correct for this.
 wRC.  Weighted Runs Created (wRC): an improved version of Bill James' Runs Created statistic, which attempted to quantify a player's total offensive value and measure it by runs.
 BsR  Base Runs: Another run estimator, like Runs Created; a favorite of writer Tom Tango
 WAR  Wins above replacement: a nonstandard formula to calculate the number of wins a player contributes to his team over a "replacementlevel player"
 Off  total runs above or below average based on offensive contributions (both batting and baserunning)
 Def  total runs above or below average based on defensive contributions (fielding and position).
Since the game statistics are in different units we standardize the data by subtracting the mean and dividing by the standard deviation, $x_{changed} = \frac{x\mu}{s}$. Additionaly, we add two new variables
Off.norm
and Def.norm
which are normalized Off
and Def
ratings using the formula
$x_{changed}=\frac{xmin(x)}{max(x)min(x)}$. We use the normalized offensive and defensive ratings to quickly evaluate the optimal team according to the ILP.
# select numeric columns and relevant variables
dat.scaled = scale(dat[,sapply(dat, class) == "numeric"][,c(1:2,19)])
# normalize Off and Def
dat$Off.norm = (dat$Offmin(dat$Off))/(max(dat$Off)min(dat$Off))
dat$Def.norm = (dat$Defmin(dat$Def))/(max(dat$Def)min(dat$Def))
head(dat.scaled[,1:4])
## PA HR R RBI
## [1,] 0.9239111 1.2879067 0.7024833 0.4469482
## [2,] 0.6851676 0.6505590 0.4831027 0.8744364
## [3,] 0.6625837 0.4115537 0.0687172 0.7989973
## [4,] 0.9634531 0.7834733 0.9306832 0.9109555
## [5,] 1.1013556 0.5708906 0.6293565 0.8744364
## [6,] 0.9634531 0.7834733 0.9306832 0.9109555
Now that we have scaled player stats we will weigh them and add them up to obtain the player utility index $U_i$ for player $i$ to use it in the objective function.
$U_i(x) = w_{1}\text{PA}_i+w_{2}\text{HR}_i+w_{3}\text{R}_i+w_{4}\text{RBI}_i+w_{5}\text{SB}_i+w_{6}\text{ISO}_i+w_{7}\text{BABIP}_i+w_{8}\text{AVG}_i+w_{9}\text{OBP}_i+w_{10}\text{SLG}_i+w_{11}\text{wOBA}_i+w_{12}\text{wRC.}_i+w_{13}\text{BsR}_i+w_{14}\text{Off}_i+w_{15}\text{Def}_i+w_{16}\text{WAR}_i$
$\text{ for player } i \text{ where } i \in \{1,199\}$
By introducing weights we can construct the weight vector which best suits our preferences.
For example, if we wanted the player utility index to value the offensive statistics like
RBI
more than the defensive statistics like Def
we would just assign a bigger weight to RBI. We decided
to value each statistic equally, i.e. weights are equal.
Constraint modelling
In baseball there are 25 active men roster and 40 men roster that includes the 25 men active roster. To start a new team we
focus on building the perfect 25 men roster. Typically, a 25 men roster will
consist of five starting pitchers (SP), seven relief pitchers (Reliever), two catchers (C), six
infielders (IN), and five outfielders (OF). Current position variable POS
has more than 5 aforementioned groups. We group them in the POS2
variable by the five types SP, Reliever, C, IN, OF.
position = function(x){ # given position x change x to group
if(x %in% c("1B","2B","3B","SS")) x = "IN"
else if(x %in% c("Closer")) x = "Reliever"
else x=as.character(x)
}
dat$POS2 = sapply(dat$POS, position)
Additionally, we will make sure that our 25 men active roster has at least one player of each of the following positions: first base (1B), second base (2B), third base (3B) and Short stop (SS).
There is no salary cap in the Major League Baseball association, but rather a threshold of 189\$ million for the 40 men roster for period 20142016 beyond which a luxury tax applies. For the first time violators the tax is 22.5% of the amount they were over the threshold. We decided that we would allocate 178$ million for the 25 men roster.
To model the above basic constraints and an objective function we came up with the player utility index $U(x_1,x_2,...,x_n)$ which is a function of the chosen set of $n$ player game statistics, 16 in our case. In our model we maximize the sum of the player utility indices. We have 16 game statistics of interest which are
PA, HR, R, RBI, SB, ISO, BABIP, AVG, OBP, SLG, wOBA, wRC., BsR, Off, Def, WAR, Off.norm, Def.norm
Below is the resulting model.
$$ \begin{align} \text{max } & \sum^{199}_{i=1}U_i*x_i \\ \text{s. t. } & \sum^{199}_{i=1}x_i = 25 \\ & \sum x_{\text{SP}} \ge 5 \\ & \sum x_{\text{Reliever}} \ge 7 \\ & \sum x_{\text{C}} \ge 2 \\ & \sum x_{\text{IN}} \ge 6 \\ & \sum x_{\text{OF}} \ge 5 \\ & \sum x_{\text{POS}} \ge 1 \text{ for } POS \in \{\text{1B,2B,3B,SS}\}\\ & \sum x_{\text{LeftHandPitchers}} \ge 2 \\ & \sum x_{\text{LeftHandBatters}} \ge 2 \\ & \frac{1}{25} \sum Stat_{ij}x_{i} \ge mean(Stat_{j}) \text{ for } j = 1,2,...,16 \\ & \sum^{199}_{i=1}salary_i*x_i \le 178 \end{align} $$
where
 $U_i$ utility index for player $i$, $i \in \{1,199\}$
 $x_i$  a binary variable which is one if player $i$ is selected
 $x_{\text{SP}}, x_{\text{Reliever}}$, etc.  binary variables that are one if player $i$ has the specified attribute such as Starting pitcher (SP), left hand pitcher, etc.
 $x_{\text{POS}}$  binary variable which is one if player $i$ plays the position $POS$, $POS \in \{\text{1B,2B,3B,SS}\}$
 $Stat_{ij}$  game statistic $j$ for player $i$, $j \in \{1,16\}$
 $mean(Stat_{j})$  the average of the statistic $j$ across all players
 $salary_i$  salary for player $i$ in dollars
Constraint (2) ensures that we get 25 players. Constraints (3) through (10) ensure that number of players with certain attributes meets the required minimum. Collection of constraints (11) makes sure that our team's average game stastistics outperform the average game statistics across all players. Constraint (12) ensures that we stay within our budget including the luxury tax.
Below is the solution of this programm.
library("lpSolve")
i = 199 # number of players (variables)
# constraints
cons = rbind(
rep(1,i), # 25 man constraint (2)
sapply(dat$POS2, function(x) if (x == "SP") x=1 else x=0), # (3)
sapply(dat$POS2, function(x) if (x == "Reliever") x=1 else x=0), # (4)
sapply(dat$POS2, function(x) if (x == "C") x=1 else x=0), # (5)
sapply(dat$POS2, function(x) if (x == "IN") x=1 else x=0), # (6)
sapply(dat$POS2, function(x) if (x == "OF") x=1 else x=0), # (7)
sapply(dat$POS, function(x) if (x == "1B") x=1 else x=0), # (8)
sapply(dat$POS, function(x) if (x == "2B") x=1 else x=0), # (8)
sapply(dat$POS, function(x) if (x == "3B") x=1 else x=0), # (8)
sapply(dat$POS, function(x) if (x == "SS") x=1 else x=0), # (8)
sapply(dat$Throws, function(x) if (x == "L") x=1 else x=0), # (9)
sapply(dat$Bats, function(x) if (x == "L") x=1 else x=0), # (10)
t(dat[,colnames(dat.scaled)])/25, # (11) outperform the average
dat$Salary/1000000 # (12) budget constraint
)
# model
f.obj = apply(dat.scaled,1,sum)
f.dir = c("=",rep(">=",27),"<=")
f.rhs = c(25,5,7,2,6,5,2,2,rep(1,4),
apply(dat[,colnames(dat.scaled)],2,mean),
178)
model = lp("max", f.obj, cons, f.dir, f.rhs, all.bin=T,compute.sens=1)
model
## Success: the objective function is 135.6201
sol = model$solution
sol
## [1] 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## [71] 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
## [141] 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
Let's look at our ideal baseball team given the constraints outlined above.
# selected players
dat[which(sol>0),c(1:3,6,28:29)]
## Salary Name POS Team Def.norm POS2
## 4 31000000 Clayton Kershaw SP 0.3713163 SP
## 8 6083333 Mike Trout OF Angels 0.4125737 OF
## 24 19750000 David Price SP 0.3713163 SP
## 26 507500 Dellin Betances Closer Yankees 0.3713163 Reliever
## 29 509000 Matt Duffy 2B Giants 0.5972495 IN
## 53 17142857 Max Scherzer SP 0.3713163 SP
## 54 17277777 Buster Posey C Giants 0.5324165 C
## 62 509500 Carson Smith Closer Mariners 0.3713163 Reliever
## 71 519500 A.J. Pollock OF Diamondbacks 0.5422397 OF
## 83 535000 Trevor Rosenthal Closer Cardinals 0.3713163 Reliever
## 87 547100 Cody Allen Closer Indians 0.3713163 Reliever
## 108 2500000 Dee Gordon 2B Marlins 0.5402750 IN
## 109 2500000 Bryce Harper OF Nationals 0.2043222 OF
## 113 2725000 Lorenzo Cain OF Royals 0.6797642 OF
## 115 3083333 Paul Goldschmidt 1B Diamondbacks 0.2357564 IN
## 119 3200000 Zach Britton Closer Orioles 0.3713163 Reliever
## 121 3630000 Jake Arrieta SP 0.3713163 SP
## 129 4300000 Josh Donaldson 3B Blue Jays 0.5815324 IN
## 139 6000000 Chris Sale SP 0.3713163 SP
## 142 7000000 Russell Martin C Blue Jays 0.6110020 C
## 143 7000000 Wade Davis Closer Royals 0.3713163 Reliever
## 150 8050000 Aroldis Chapman Closer Reds 0.3713163 Reliever
## 163 10500000 Yoenis Cespedes OF    0.5795678 OF
## 176 14000000 Joey Votto 1B Reds 0.1886051 IN
## 194 543000 Xander Bogaerts SS Red Sox 0.5284872 IN
Seems like a decent team with the mean normalized offensive and defensive ratings of 0.414495 and 0.4275835 respectively. For comparison mean normalized offensive and defensive ratings for all players are 0.3019702 and 0.3821564 respectively. Our team outperforms the average and its mean offensive and defensive ratings are better than $82.9145729$% and $78.3919598$% of other players correspondingly.
While this is a straightforward way to model the selection of the players there are several
nuances we need to address. One of them is that the standardized game statistics are not
additively independent. As a result, our utility index poorly measures the player's value and is biased. It is possible to construct an unbiased utility index which has been done a lot in baseball (look up sabermetrics). Off
and Def
and a lot of other statistics are examples of utility indices. A reddit user suggested a solid way to construct the utility index.
Another issue we need to addrees is when we substituted the missing values with zero. Players with missing game statistics values have their utility index diminished because one of the stats used to calculate it is zero. However, imputing with zero is better than imputing with the mean in our case. By imputing with the mean we would introduce new information into the data which may be misleading, ex. g. a player's game stat is worse/better than the average. As a result, the player utility index would be overestimated/underestimated.
Finally, I believe that using statistical and mathematical methods is only acceptable as a supplement to the decision making process not only in baseball, but in every field.
Appendix
Baseball statistics abbreviations
 PA  Plate appearance: number of completed batting appearances
 HR  Home runs: hits on which the batter successfully touched all four bases, without the contribution of a fielding error
 R  Runs scored: number of times a player crosses home plate
 RBI  Run batted in: number of runners who score due to a batters' action, except when batter grounded into double play or reached on an error
 SB  Stolen base: number of bases advanced by the runner while the ball is in the possession of the defense
 ISO  Isolated power: a hitter's ability to hit for extra bases, calculated by subtracting batting average from slugging percentage
 BABIP  Batting average on balls in play: frequency at which a batter reaches a base after putting the ball in the field of play. Also a pitching category
 AVG  Batting average (also abbreviated BA): hits divided by at bats
 OBP  Onbase percentage: times reached base divided by at bats plus walks plus hit by pitch plus sacrifice flies
 SLG  Slugging average: total bases achieved on hits divided by atbats
 wOBA  Some argue that the OPS, onbase plus slugging, formula is flawed and that more weight should be shifted towards OBP (onbase percentage). The statistic wOBA (weighted onbase average) attempts to correct for this.
 wRC.  Weighted Runs Created (wRC): an improved version of Bill James' Runs Created statistic, which attempted to quantify a player's total offensive value and measure it by runs.
 BsR  Base Runs: Another run estimator, like Runs Created; a favorite of writer Tom Tango
 WAR  Wins above replacement: a nonstandard formula to calculate the number of wins a player contributes to his team over a "replacementlevel player"
 Off  total runs above or below average based on offensive contributions (both batting and baserunning)
 Def  total runs above or below average based on defensive contributions (fielding and position).
Source: Wikipedia::Baseball statistics