## Efficient algorithms for solving p-median problems with radius constraints; and, Its application to clustering with feature selection

##### Date

31/07/2021##### Author

Martín del Campo Barraza, Minerva Montserrat

##### Metadata

##### Abstract

Facility location problems aim to identify where to locate facilities to satisfy the demand of a
given set of customers. A location for a facility is selected by minimising the total supply cost
of the demand to the customers. The p-median problem is one of the fundamental problems in
discrete location, where a fixed number of facilities must be located in order to minimise the
total distances from every customer to the facility the customer is allocated to.
The most recent exact method to solve the p-median problem is a radius formulation, where the
problem is defined as a set covering problem with cumulative variables. Under this definition,
the problem is reframed as a partial formulation where more inequalities are added as needed
via a row generation technique. This strategy is embedded in a branch-and-bound algorithm
and it is able to solve very large instances with several thousands of nodes. However, the performance of the strategy for small values of p is still unsatisfactory. Since in some applications
of the p-median problem, as in the location of distribution centres, the p-median problem is
solved for small values of p, the development of an exact algorithm that allows us to solve large
instances for small values of p is still needed.
In this thesis, I develop a branch-and-bound algorithm combined with Lagrangian relaxation
method. The radius constraint is relaxed to form the Lagrangian dual problem. This relaxation
is then used in a branch-and-bound algorithm to find lower bounds at every node. I perform
a novel branching procedure called constraint branching, where branching is done by selecting
violated constraints rather than the common variable fixing branching. The algorithm also
exploits the structure of the cumulative variables and covering inequalities to accelerate the
pruning and bounding of the tree. I have shown that exact solutions can be obtained with this
new branching approach via a computational study to test the efficiency of this algorithm.
The p-median problem has been applied successfully to cluster analysis, with the problem
of clustering with feature selection as one of these applications. The goal of the problem is to
partition a collection of objects into groups called clusters in such a way that objects within each
cluster are similar to each other given their shared features. The complexity of this problem
increases when a large number of features is available because only a few of them are necessary
to discover the true structure of a cluster. Therefore, a solution to this problem is required to
incorporate a selection procedure where the most relevant features for the cluster representation
are chosen.
This problem can be modelled mathematically using a formulation based on the p-median
problem. Since two simultaneous decisions are made (clustering and feature selection), the
model can be seen as a double p-median problem formulation. Recently, a mixed-integer linear program was introduced to perform clustering and feature selection at the same time. I
present an alternative formulation based on the radius formulation idea and introduce the use
of what I call pattern variables, rather than the usual cumulative decision variables, allowing
for a much sparser structure. A computational study is carried out for binary and real-valued
datasets to show that this model is more efficient, particularly when dealing with larger datasets.
Lastly, I present an alternative formulation for the p-median problem based on the radius
formulation with pattern variables. The use of pattern variables has the advantage that it
creates a partition structure on the constraints, rather than the nested structure created by
the cumulative variables used in the radius formulation. A computational study is performed
to compare the performance of the proposed formulation with pattern variables against the
radius formulation with cumulative variables. The former is shown to be superior when using a direct implementation in CPLEX, and outperforms the latter when implemented in a
branch-and-bound algorithm with problems for small values of p