── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyLP)
Introduction
Yesterday on Mastodon I saw this post about the tidyLPpackage. I was thinking “I have a spiffy new blog (with the same un-spiffy old posts) and I haven’t written anything in 2025, so maybe I should write about my experiences with the package before the year is out”.
A back-story: I moved to the UK to teach for a bit (this was some years ago) and learned that Operations Research (linear programming, integer programming, transportation and assignment problems, etc.) fell under the remit of Statistics, even though there is no actual randomness in any of the above, so I was expected to teach that stuff even though I had not used it in a long time, or ever. So I had some late nights learning or re-learning material to teach to my students in the next week. The book we used was the massive tome by Winston (this edition). The link doesn’t say how many pages, but “weight: 2.45 kg” gives you the idea. Why we were using a massively wordy American text rather than a much more concise British one, I have no idea.
A first example
One of the first linear programming problems in Winston goes like1 this:
A toy company makes wooden toy soldiers and toy trains. Each soldier requires 2 hours of finishing time and one hour of carpentry time, and results in a profit of $3. Each train requires 1 hour of finishing time and 1 hour of carpentry time, and results in a profit of $2. There are 100 finishing hours and 80 carpentry hours available, and previous experience indicates that a maximum of 40 soldiers can be sold. How many soldiers and trains should be made in order to maximize profit?
There are two things that make this a linear programming problem:
the function to be optimized (here profit) is linear
there are some (also linear) constraints on what we can do.
The standard by-hand method of solving these problems is called the Simplex Method. The idea is that the constraints define a region in space called a simplex, and the highest profit is at one of the corners of the the simplex. The Simplex Method tells you how to move from corner to corner while increasing the profit (and thus, when there is nowhere else to move to, you have found the optimal solution).
Unless you are learning the method specifically, these days there is no need to learn the Simplex Method to solve these problems, and the tidyLP package provides a function lp_solve to find the solution for you.
In this problem, the logical flow is:
the number of soldiers and trains you make determine how much profit you make.
the number of soldiers and trains made also determine how many finishing and carpentry hours you need
there are limits on all of those things, which you need to respect.
The way you set this up in tidyLP is to begin with a dataframe which captures the relationships in the first of those two bullet points. Each row of that dataframe contains something you want to optimize over, and each column contains the contribution of each additional unit of those things to your profit or to your finishing and carpentry hours, like this:2
The inputs to tidy_lp are our dataframe, the column of it to optimize (by default maximize), and then one for each constraint. For constraints expressed in terms of the columns of our dataframe, you put the column name on the left of a squiggle, and on the right you put for example leq (“less than or equal to”) with a number inside saying what it is less than or equal to. For constraints expressed in terms of what we are optimizing over (the soldiers constraint here), the notation is different: the above shows how we express “the number of soldiers is less than 40”. I put soldiers and trains in a column called product.
For constraints, there are also geq() and eq(), which are “greater or equal” and “exactly equal” respectively.
The last step shows that the optimal solution is to make 20 soldiers and 60 trains. This happens to be the same answer as Winston got, so it is presumably correct.
We can go further, and investigate how many of our finishing and carpentry hours we ended up using. For this, we note that what I called sol has all the columns in our original dataframe in it, so we can do calculations like this:
sol %>%mutate(across(finishing:profit, \(x) x * .solution))
sol %>%summarize(across(finishing:profit, \(x) sum(x * .solution)))
This shows that we made a profit of $60 from soldiers and $120 from trains for a total of $180. Also, we used 40 finishing hours for soldiers and 60 for trains, so we used up our total 100 finishing hours. For carpentry hours, we used 20 for making soldiers and 60 for making trains, so we also used up our total of 80 of these. We only made 20 soldiers, though, not reaching our limit of 40: this constraint is not “binding”.
None of this, of course, is actually a proof that the answer is best (other than our confidence in lp_solve): it respects all the constraints, but in principle there could be a better number of soldiers and trains to make that also respects the constraints.
A second example
This one comes directly from Winston (Problem 3 in Section 3.2 in my edition):
Leary Chemical manufactures three chemicals: A, B, and C. These chemicals are produced via two production processes: 1 and 2. Running process 1 for an hour costs $4 and yields 3 units of A, 1 of B, and 1 of C. Running process 2 for an hour costs $1 and produces 1 unit of A and 1 of B. To meet customer demands, at least 10 units of A, 5 of B, and 3 of C must be produced daily. Graphically determine a daily production plan that minimizes the cost of meeting Leary Chemical’s daily demands.
Note that in this problem, there is no benefit to producing extra of the chemicals beyond what is needed to meet customer demands: here, we are not selling it at a profit, but producing it at a cost.
The logical flow is that the number of hours processes 1 and 2 are run (to be optimized over) determines the amounts of chemicals A, B, and C that are produced, so these are the relationships that need to be summarized in our dataframe. I have called the processes P1 and P2 (so that I don’t get confused):
Process 2 does not produce any of chemical C, hence the zero in the dataframe.
The constraints (meeting customer demand) are all in terms of columns of this dataframe (there are, for example, no limits on how many hours we can run each process for), so there is no difficulty in expressing the constraints:
tidy_lp( d, cost, A ~geq(10), B ~geq(5), C ~geq(3),.direction ="min") %>%lp_solve() %>%bind_solution() -> solsol %>%select(process, .solution)
The one new thing is that this is a minimum-cost problem. This is specified by using the additional input .direction.
The solution is to run process 1 for 3 hours and process 2 for 2 hours. How much of each chemical is produced under this plan?
sol %>%mutate(across(A:cost, \(x) x * .solution))
sol %>%summarize(across(A:cost, \(x) sum(x * .solution)))
There is no extra of chemicals B and C produced, but this plan produces one extra unit of chemical A. This seems to be unavoidable: to produce enough of C, process 1 has to be run for 3 hours, and to produce enough of B, process 2 has to be run in addition for 2 hours. Doing that produces \(9+2 = 11\) units of A.
Being a statistician
As I said earlier, this is a mathematical problem and not a statistical one. Suppose we had no knowledge of the simplex method or the tidyLP package. What would we do? One way is to generate some numbers of hours to run each process, work out the cost of doing so and the amounts of the three chemicals produced. This part is a bit complicated, so we’ll put it in a function, imaginatively called f:
f <-function(d, p1, p2) { v <-c(p1, p2) d %>%summarize(across(A:cost, \(x) sum(x * v)))}
This returns a one-row dataframe with the total cost and the amounts of A, B, and C produced:
f(d, 3, 2)
reproducing our answer from earlier.
All right, let’s generate some literally random3 number of hours to run each process (let’s say between 0 and 5):
Now, we can run our function for each of those. My function is not vectorized,4 so I use rowwise for this:
g %>%rowwise() %>%mutate(ans =list(f(d, p1, p2)))
and now I unnest ans wider to see the actual values:
g %>%rowwise() %>%mutate(ans =list(f(d, p1, p2))) %>%unnest_wider(ans)
I cannot just minimize the cost here; I need to make sure I produced enough of each chemical:
g %>%rowwise() %>%mutate(ans =list(f(d, p1, p2))) %>%unnest_wider(ans) %>%mutate(constr_ok = (A >=10& B >=5& C >=3))
Only two of these rows satisfy all the constraints, so I filter those and find the one with the smallest cost:
g %>%rowwise() %>%mutate(ans =list(f(d, p1, p2))) %>%unnest_wider(ans) %>%mutate(constr_ok = (A >=10& B >=5& C >=3)) %>%filter(constr_ok) %>%slice_min(cost)
Using this amazingly mindless method, our best cost is 17.9 (compared to the actual optimal 14), running process 1 for 4.10 hours and process 2 for 1.49 hours. But at least it satisfies the constraints.
The thing about simulations is that you can get an answer as close as you like by running it long enough. Let’s try 1000 simulations this time. The code is all there; this time we redefine g:
g %>%rowwise() %>%mutate(ans =list(f(d, p1, p2))) %>%unnest_wider(ans) %>%mutate(constr_ok = (A >=10& B >=5& C >=3)) %>%filter(constr_ok) %>%slice_min(cost)
and this time we have gotten much closer to the answer that tidyLP gave us: a cost of 14.4 from running process 1 for 3.03 hours and process 2 for 2.30 hours, and we can check that at least enough of each chemical was produced.
Simulation for the win. Or something like that.
Integer programming
The tidyLP package can also handle “integer programming”, where the answers must be integers or a binary “yes/no”. In the package README, the author gives an actual practical non-toy example of selecting a fantasy basketball team, given limits on the total salary and requirements for players playing each position and the number that can be from the same NBA team. There are over 300 players to select from, and each player is either in or not in the fantasy team (so it is a binary problem).
Here is a rather artificial integer problem, based on problem 15 in section 9.2 of Winston:5
Fruit Computer makes two types of computer. A Pear computer requires 1 hour of labour and 2 chips, and sells for $400; an Apricot computer requires 2 hours of labour and 5 chips. There are 23 chips and 15 hours of labour available. The company should make at least as many Apricot computers as Pear ones. How many computers of each type should Fruit Computer manufacture to maximize revenue?
The setup for this is the same as for a regular linear programming problem:
This time, we have a constraint in terms of computers (not just the columns of the dataframe d). The right side of a constraint has to be a number, so we write this as “the number of Apricots made minus the number of Pears made must be greater or equal to zero”. With that in mind, and solving as before, we get:
The best solution is to make 1 Pear computer and 4 Apricot. Note that this solution is not nearly the same as the decimal solution to this problem rounded off. Making 3 Pears and 3 Apricots uses 9 hours of labour and 21 chips, so it satisfies the constraints, with a revenue of $3900, but…
sol %>%mutate(across(labour:selling, \(x) x * .solution))
sol %>%summarize(across(labour:selling, \(x) sum(x * .solution)))
The revenue is $4000, more than from making 3 computers of each type. This solution uses 22 of the 23 chips and only 9 of the 15 available hours of labour. Requiring the answers to be integers has limited us a good bit.
Footnotes
I have simplified some details, but the problem is the same.↩︎
The obvious way to set this up is with a tribble, but I always find that I get fed up with typing quotes with these, so the way I use here takes advantage of being able to read in a piece of text as if it were a CSV file.↩︎