We try to use Integer Linear Programming to build a perfect 25 men roster baseball team. We present our best team below which is the solution of the ILP model we built using the 2015 MLB season player data. If you understand baseball please evaluate our resulting baseball team and drop a comment, so that we know whether ILP can be used to get a decent baseball team. After the table I describe how we arrived at our solution.

Edit: The choice of statistics for our utility index is almost random. The main goal was to model the general constraints and objective function. This code allows to easily add desired statistics and extend the general case to include more sophisticated preferences, for example using the weight vector.


Prerequisites

To follow the process of us setting up the ILP model you should have familitarity with

  • Linear algebra
  • Linear optimization
  • Integer programming

Data preprocessing

Let's read in the 2015 regular season player level data.

dat = read.csv("Baseball Data.csv")
head(dat[,1:4])
##     Salary              Name      POS Bats
## 1   510000      Joc Pederson       OF    L
## 2   512500      Stephen Vogt       1B    L
## 3  3550000      Wilson Ramos        C    R
## 4 31000000   Clayton Kershaw       SP     
## 5 15000000    Jhonny Peralta       SS    R
## 6  2000000 Carlos Villanueva Reliever

The dataset has 199 rows (players). There were NA's for some players and their game statistics which we replaced with 0. The reason we replaced the missing data with zeros is that when we construct the player utility index missing data won't count towards or against players.

dat[is.na(dat)] = 0

Each baseball player has game statistics associated with them. Below is the list of player level data.

names(dat)
##  [1] "Salary"   "Name"     "POS"      "Bats"     "Throws"   "Team"    
##  [7] "G"        "PA"       "HR"       "R"        "RBI"      "SB"      
## [13] "BB."      "K."       "ISO"      "BABIP"    "AVG"      "OBP"     
## [19] "SLG"      "wOBA"     "wRC."     "BsR"      "Off"      "Def"     
## [25] "WAR"      "playerid"

You can see the statistics description in the collapsible list below or appendix.

  • Baseball statistics abbreviations
    • PA - Plate appearance: number of completed batting appearances
    • HR - Home runs: hits on which the batter successfully touched all four bases, without the contribution of a fielding error
    • R - Runs scored: number of times a player crosses home plate
    • RBI - Run batted in: number of runners who score due to a batters' action, except when batter grounded into double play or reached on an error
    • SB - Stolen base: number of bases advanced by the runner while the ball is in the possession of the defense
    • ISO - Isolated power: a hitter's ability to hit for extra bases, calculated by subtracting batting average from slugging percentage
    • BABIP - Batting average on balls in play: frequency at which a batter reaches a base after putting the ball in the field of play. Also a pitching category
    • AVG - Batting average (also abbreviated BA): hits divided by at bats
    • OBP - On-base percentage: times reached base divided by at bats plus walks plus hit by pitch plus sacrifice flies
    • SLG - Slugging average: total bases achieved on hits divided by at-bats
    • wOBA - Some argue that the OPS, on-base plus slugging, formula is flawed and that more weight should be shifted towards OBP (on-base percentage). The statistic wOBA (weighted on-base average) attempts to correct for this.
    • wRC. - Weighted Runs Created (wRC): an improved version of Bill James' Runs Created statistic, which attempted to quantify a player's total offensive value and measure it by runs.
    • BsR - Base Runs: Another run estimator, like Runs Created; a favorite of writer Tom Tango
    • WAR - Wins above replacement: a non-standard formula to calculate the number of wins a player contributes to his team over a "replacement-level player"
    • Off - total runs above or below average based on offensive contributions (both batting and baserunning)
    • Def - total runs above or below average based on defensive contributions (fielding and position).

Since the game statistics are in different units we standardize the data by subtracting the mean and dividing by the standard deviation, $x_{changed} = \frac{x-\mu}{s}$. Additionaly, we add two new variables Off.norm and Def.norm which are normalized Off and Def ratings using the formula $x_{changed}=\frac{x-min(x)}{max(x)-min(x)}$. We use the normalized offensive and defensive ratings to quickly evaluate the optimal team according to the ILP.

# select numeric columns and relevant variables
dat.scaled = scale(dat[,sapply(dat, class) == "numeric"][,c(-1:-2,-19)])

# normalize Off and Def
dat$Off.norm = (dat$Off-min(dat$Off))/(max(dat$Off)-min(dat$Off))
dat$Def.norm = (dat$Def-min(dat$Def))/(max(dat$Def)-min(dat$Def))

head(dat.scaled[,1:4])
##              PA         HR          R        RBI
## [1,]  0.9239111  1.2879067  0.7024833  0.4469482
## [2,]  0.6851676  0.6505590  0.4831027  0.8744364
## [3,]  0.6625837  0.4115537  0.0687172  0.7989973
## [4,] -0.9634531 -0.7834733 -0.9306832 -0.9109555
## [5,]  1.1013556  0.5708906  0.6293565  0.8744364
## [6,] -0.9634531 -0.7834733 -0.9306832 -0.9109555

Now that we have scaled player stats we will weigh them and add them up to obtain the player utility index $U_i$ for player $i$ to use it in the objective function.

$U_i(x) = w_{1}\text{PA}_i+w_{2}\text{HR}_i+w_{3}\text{R}_i+w_{4}\text{RBI}_i+w_{5}\text{SB}_i+w_{6}\text{ISO}_i+w_{7}\text{BABIP}_i+w_{8}\text{AVG}_i+w_{9}\text{OBP}_i+w_{10}\text{SLG}_i+w_{11}\text{wOBA}_i+w_{12}\text{wRC.}_i+w_{13}\text{BsR}_i+w_{14}\text{Off}_i+w_{15}\text{Def}_i+w_{16}\text{WAR}_i$

$\text{ for player } i \text{ where } i \in \{1,199\}$

By introducing weights we can construct the weight vector which best suits our preferences. For example, if we wanted the player utility index to value the offensive statistics like RBI more than the defensive statistics like Def we would just assign a bigger weight to RBI. We decided to value each statistic equally, i.e. weights are equal.

Constraint modelling

In baseball there are 25 active men roster and 40 men roster that includes the 25 men active roster. To start a new team we focus on building the perfect 25 men roster. Typically, a 25 men roster will consist of five starting pitchers (SP), seven relief pitchers (Reliever), two catchers (C), six infielders (IN), and five outfielders (OF). Current position variable POS has more than 5 aforementioned groups. We group them in the POS2 variable by the five types SP, Reliever, C, IN, OF.

position = function(x){ # given position x change x to group
  if(x %in% c("1B","2B","3B","SS")) x = "IN"
  else if(x %in% c("Closer")) x = "Reliever"
  else x=as.character(x)
}

dat$POS2 = sapply(dat$POS, position)

Additionally, we will make sure that our 25 men active roster has at least one player of each of the following positions: first base (1B), second base (2B), third base (3B) and Short stop (SS).

There is no salary cap in the Major League Baseball association, but rather a threshold of 189\$ million for the 40 men roster for period 2014-2016 beyond which a luxury tax applies. For the first time violators the tax is 22.5% of the amount they were over the threshold. We decided that we would allocate 178$ million for the 25 men roster.

To model the above basic constraints and an objective function we came up with the player utility index $U(x_1,x_2,...,x_n)$ which is a function of the chosen set of $n$ player game statistics, 16 in our case. In our model we maximize the sum of the player utility indices. We have 16 game statistics of interest which are

PA, HR, R, RBI, SB, ISO, BABIP, AVG, OBP, SLG, wOBA, wRC., BsR, Off, Def, WAR, Off.norm, Def.norm

Below is the resulting model.

$$ \begin{align} \text{max } & \sum^{199}_{i=1}U_i*x_i \\ \text{s. t. } & \sum^{199}_{i=1}x_i = 25 \\ & \sum x_{\text{SP}} \ge 5 \\ & \sum x_{\text{Reliever}} \ge 7 \\ & \sum x_{\text{C}} \ge 2 \\ & \sum x_{\text{IN}} \ge 6 \\ & \sum x_{\text{OF}} \ge 5 \\ & \sum x_{\text{POS}} \ge 1 \text{ for } POS \in \{\text{1B,2B,3B,SS}\}\\ & \sum x_{\text{LeftHandPitchers}} \ge 2 \\ & \sum x_{\text{LeftHandBatters}} \ge 2 \\ & \frac{1}{25} \sum Stat_{ij}x_{i} \ge mean(Stat_{j}) \text{ for } j = 1,2,...,16 \\ & \sum^{199}_{i=1}salary_i*x_i \le 178 \end{align} $$

where

  • $U_i$- utility index for player $i$, $i \in \{1,199\}$
  • $x_i$ - a binary variable which is one if player $i$ is selected
  • $x_{\text{SP}}, x_{\text{Reliever}}$, etc. - binary variables that are one if player $i$ has the specified attribute such as Starting pitcher (SP), left hand pitcher, etc.
  • $x_{\text{POS}}$ - binary variable which is one if player $i$ plays the position $POS$, $POS \in \{\text{1B,2B,3B,SS}\}$
  • $Stat_{ij}$ - game statistic $j$ for player $i$, $j \in \{1,16\}$
  • $mean(Stat_{j})$ - the average of the statistic $j$ across all players
  • $salary_i$ - salary for player $i$ in dollars

Constraint (2) ensures that we get 25 players. Constraints (3) through (10) ensure that number of players with certain attributes meets the required minimum. Collection of constraints (11) makes sure that our team's average game stastistics outperform the average game statistics across all players. Constraint (12) ensures that we stay within our budget including the luxury tax.

Below is the solution of this programm.

library("lpSolve")

i = 199 # number of players (variables)

# constraints
cons = rbind(
  rep(1,i), # 25 man constraint (2)
  sapply(dat$POS2, function(x) if (x == "SP") x=1 else x=0), # (3)
  sapply(dat$POS2, function(x) if (x == "Reliever") x=1 else x=0), # (4)
  sapply(dat$POS2, function(x) if (x == "C") x=1 else x=0), # (5)
  sapply(dat$POS2, function(x) if (x == "IN") x=1 else x=0), # (6)
  sapply(dat$POS2, function(x) if (x == "OF") x=1 else x=0), # (7)
  sapply(dat$POS, function(x) if (x == "1B") x=1 else x=0), # (8)
  sapply(dat$POS, function(x) if (x == "2B") x=1 else x=0), # (8)
  sapply(dat$POS, function(x) if (x == "3B") x=1 else x=0), # (8)
  sapply(dat$POS, function(x) if (x == "SS") x=1 else x=0), # (8)
  sapply(dat$Throws, function(x) if (x == "L") x=1 else x=0), # (9)
  sapply(dat$Bats, function(x) if (x == "L") x=1 else x=0), # (10)
  t(dat[,colnames(dat.scaled)])/25, # (11) outperform the average
  dat$Salary/1000000 # (12) budget constraint
)

# model
f.obj = apply(dat.scaled,1,sum)
f.dir = c("=",rep(">=",27),"<=")
f.rhs = c(25,5,7,2,6,5,2,2,rep(1,4),
          apply(dat[,colnames(dat.scaled)],2,mean),
          178)

model = lp("max", f.obj, cons, f.dir, f.rhs, all.bin=T,compute.sens=1)
model
## Success: the objective function is 135.6201
sol = model$solution
sol
##   [1] 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
##  [71] 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
## [141] 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

Let's look at our ideal baseball team given the constraints outlined above.

# selected players
dat[which(sol>0),c(1:3,6,28:29)]
##       Salary             Name    POS         Team  Def.norm     POS2
## 4   31000000  Clayton Kershaw     SP              0.3713163       SP
## 8    6083333       Mike Trout     OF       Angels 0.4125737       OF
## 24  19750000      David Price     SP              0.3713163       SP
## 26    507500  Dellin Betances Closer      Yankees 0.3713163 Reliever
## 29    509000       Matt Duffy     2B       Giants 0.5972495       IN
## 53  17142857     Max Scherzer     SP              0.3713163       SP
## 54  17277777     Buster Posey      C       Giants 0.5324165        C
## 62    509500     Carson Smith Closer     Mariners 0.3713163 Reliever
## 71    519500     A.J. Pollock     OF Diamondbacks 0.5422397       OF
## 83    535000 Trevor Rosenthal Closer    Cardinals 0.3713163 Reliever
## 87    547100       Cody Allen Closer      Indians 0.3713163 Reliever
## 108  2500000       Dee Gordon     2B      Marlins 0.5402750       IN
## 109  2500000     Bryce Harper     OF    Nationals 0.2043222       OF
## 113  2725000     Lorenzo Cain     OF       Royals 0.6797642       OF
## 115  3083333 Paul Goldschmidt     1B Diamondbacks 0.2357564       IN
## 119  3200000     Zach Britton Closer      Orioles 0.3713163 Reliever
## 121  3630000     Jake Arrieta     SP              0.3713163       SP
## 129  4300000   Josh Donaldson     3B    Blue Jays 0.5815324       IN
## 139  6000000       Chris Sale     SP              0.3713163       SP
## 142  7000000   Russell Martin      C    Blue Jays 0.6110020        C
## 143  7000000       Wade Davis Closer       Royals 0.3713163 Reliever
## 150  8050000  Aroldis Chapman Closer         Reds 0.3713163 Reliever
## 163 10500000  Yoenis Cespedes     OF        - - - 0.5795678       OF
## 176 14000000       Joey Votto     1B         Reds 0.1886051       IN
## 194   543000  Xander Bogaerts     SS      Red Sox 0.5284872       IN

Seems like a decent team with the mean normalized offensive and defensive ratings of 0.414495 and 0.4275835 respectively. For comparison mean normalized offensive and defensive ratings for all players are 0.3019702 and 0.3821564 respectively. Our team outperforms the average and its mean offensive and defensive ratings are better than $82.9145729$% and $78.3919598$% of other players correspondingly.

While this is a straightforward way to model the selection of the players there are several nuances we need to address. One of them is that the standardized game statistics are not additively independent. As a result, our utility index poorly measures the player's value and is biased. It is possible to construct an unbiased utility index which has been done a lot in baseball (look up sabermetrics). Off and Defand a lot of other statistics are examples of utility indices. A reddit user suggested a solid way to construct the utility index.

Another issue we need to addrees is when we substituted the missing values with zero. Players with missing game statistics values have their utility index diminished because one of the stats used to calculate it is zero. However, imputing with zero is better than imputing with the mean in our case. By imputing with the mean we would introduce new information into the data which may be misleading, ex. g. a player's game stat is worse/better than the average. As a result, the player utility index would be overestimated/underestimated.

Finally, I believe that using statistical and mathematical methods is only acceptable as a supplement to the decision making process not only in baseball, but in every field.

Appendix

Baseball statistics abbreviations

  • PA - Plate appearance: number of completed batting appearances
  • HR - Home runs: hits on which the batter successfully touched all four bases, without the contribution of a fielding error
  • R - Runs scored: number of times a player crosses home plate
  • RBI - Run batted in: number of runners who score due to a batters' action, except when batter grounded into double play or reached on an error
  • SB - Stolen base: number of bases advanced by the runner while the ball is in the possession of the defense
  • ISO - Isolated power: a hitter's ability to hit for extra bases, calculated by subtracting batting average from slugging percentage
  • BABIP - Batting average on balls in play: frequency at which a batter reaches a base after putting the ball in the field of play. Also a pitching category
  • AVG - Batting average (also abbreviated BA): hits divided by at bats
  • OBP - On-base percentage: times reached base divided by at bats plus walks plus hit by pitch plus sacrifice flies
  • SLG - Slugging average: total bases achieved on hits divided by at-bats
  • wOBA - Some argue that the OPS, on-base plus slugging, formula is flawed and that more weight should be shifted towards OBP (on-base percentage). The statistic wOBA (weighted on-base average) attempts to correct for this.
  • wRC. - Weighted Runs Created (wRC): an improved version of Bill James' Runs Created statistic, which attempted to quantify a player's total offensive value and measure it by runs.
  • BsR - Base Runs: Another run estimator, like Runs Created; a favorite of writer Tom Tango
  • WAR - Wins above replacement: a non-standard formula to calculate the number of wins a player contributes to his team over a "replacement-level player"
  • Off - total runs above or below average based on offensive contributions (both batting and baserunning)
  • Def - total runs above or below average based on defensive contributions (fielding and position).

Source: Wikipedia::Baseball statistics