Skip to content

Bayesian Multinomial Logistic Regression

Multinomial logistic regression is an extension of logistic regression. Logistic regression is used to model problems in which there are exactly two possible discrete outcomes. Multinomial logistic regression is used to model problems in which there are two or more possible discrete outcomes.

In our example, we’ll be using the iris dataset. The goal of the iris multiclass problem is to predict the species of a flower given measurements (in centimeters) of sepal length and width and petal length and width. There are three possible species: Iris setosa, Iris versicolor, and Iris virginica.

To start, let’s import all the libraries we’ll need.

# Load Turing.
using Turing

# Load RDatasets.
using RDatasets

# Load StatsPlots for visualizations and diagnostics.
using StatsPlots

# Functionality for splitting and normalizing the data.
using MLDataUtils: shuffleobs, splitobs, rescale!

# We need a softmax function which is provided by NNlib.
using NNlib: softmax

# Set a seed for reproducibility.
using Random
Random.seed!(0)

# Hide the progress prompt while sampling.
Turing.turnprogress(false);
┌ Info: Precompiling Turing [fce5fe82-541a-59a6-adf8-730c64b5f9a0]
└ @ Base loading.jl:1260
┌ Info: Precompiling RDatasets [ce6b1742-4840-55fa-b093-852dadbb1d8b]
└ @ Base loading.jl:1260
┌ Info: Precompiling StatsPlots [f3b207a7-027a-5e70-b257-86293d7955fd]
└ @ Base loading.jl:1260
┌ Info: Precompiling MLDataUtils [cc2ba9b6-d476-5e6d-8eaf-a92d5412d41d]
└ @ Base loading.jl:1260
┌ Info: [Turing]: progress logging is disabled globally
└ @ Turing /home/cameron/.julia/packages/Turing/3goIa/src/Turing.jl:23
┌ Info: [AdvancedVI]: global PROGRESS is set as false
└ @ AdvancedVI /home/cameron/.julia/packages/AdvancedVI/PaSeO/src/AdvancedVI.jl:15

Data Cleaning & Set Up

Now we’re going to import our dataset. Twenty rows of the dataset are shown below so you can get a good feel for what kind of data we have.

# Import the "iris" dataset.
data = RDatasets.dataset("datasets", "iris");

# Show twenty random rows.
data[rand(1:size(data, 1), 20), :]

20 rows × 5 columns

SepalLengthSepalWidthPetalLengthPetalWidthSpecies
Float64Float64Float64Float64Cat…
15.62.93.61.3versicolor
25.82.73.91.2versicolor
35.52.34.01.3versicolor
46.73.35.72.5virginica
55.13.51.40.2setosa
65.13.81.50.3setosa
74.83.41.90.2setosa
86.02.94.51.5versicolor
96.93.15.42.1virginica
105.43.91.70.4setosa
115.03.61.40.2setosa
125.73.04.21.2versicolor
135.03.31.40.2setosa
147.73.06.12.3virginica
155.82.85.12.4virginica
164.43.01.30.2setosa
176.33.34.71.6versicolor
186.02.75.11.6versicolor
194.63.41.40.3setosa
206.02.24.01.0versicolor

In this data set, the outcome Species is currently coded as a string. We convert it to a numerical value by using indices 1, 2, and 3 to indicate species setosa, versicolor, and virginica, respectively.

# Recode the `Species` column.
species = ["setosa", "versicolor", "virginica"]
data[!, :Species_index] = indexin(data[!, :Species], species)

# Show twenty random rows of the new species columns
data[rand(1:size(data, 1), 20), [:Species, :Species_index]]

20 rows × 2 columns

SpeciesSpecies_index
Cat…Union…
1versicolor2
2setosa1
3setosa1
4versicolor2
5setosa1
6virginica3
7versicolor2
8versicolor2
9setosa1
10virginica3
11setosa1
12setosa1
13versicolor2
14versicolor2
15virginica3
16versicolor2
17setosa1
18setosa1
19virginica3
20setosa1

After we’ve done that tidying, it’s time to split our dataset into training and testing sets, and separate the features and target from the data. Additionally, we must rescale our feature variables so that they are centered around zero by subtracting each column by the mean and dividing it by the standard deviation. Without this step, Turing’s sampler will have a hard time finding a place to start searching for parameter estimates.

# Split our dataset 50%/50% into training/test sets.
trainset, testset = splitobs(shuffleobs(data), 0.5)

# Define features and target.
features = [:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]
target = :Species_index

# Turing requires data in matrix and vector form.
train_features = Matrix(trainset[!, features])
test_features = Matrix(testset[!, features])
train_target = trainset[!, target]
test_target = testset[!, target]

# Standardize the features.
μ, σ = rescale!(train_features; obsdim = 1)
rescale!(test_features, μ, σ; obsdim = 1);

Model Declaration

Finally, we can define our model logistic_regression. It is a function that takes three arguments where

  • x is our set of independent variables;
  • y is the element we want to predict;
  • σ is the standard deviation we want to assume for our priors.

We select the setosa species as the baseline class (the choice does not matter). Then we create the intercepts and vectors of coefficients for the other classes against that baseline. More concretely, we create scalar intercepts intercept_versicolor and intersept_virginica and coefficient vectors coefficients_versicolor and coefficients_virginica with four coefficients each for the features SepalLength, SepalWidth, PetalLength and PetalWidth. We assume a normal distribution with mean zero and standard deviation σ as prior for each scalar parameter. We want to find the posterior distribution of these, in total ten, parameters to be able to predict the species for any given set of features.

# Bayesian multinomial logistic regression
@model function logistic_regression(x, y, σ)
    n = size(x, 1)
    length(y) == n || throw(DimensionMismatch("number of observations in `x` and `y` is not equal"))

    # Priors of intercepts and coefficients.
    intercept_versicolor ~ Normal(0, σ)
    intercept_virginica ~ Normal(0, σ)
    coefficients_versicolor ~ MvNormal(4, σ)
    coefficients_virginica ~ MvNormal(4, σ)

    # Compute the likelihood of the observations.
    values_versicolor = intercept_versicolor .+ x * coefficients_versicolor
    values_virginica = intercept_virginica .+ x * coefficients_virginica
    for i in 1:n
        # the 0 corresponds to the base category `setosa`
        v = softmax([0, values_versicolor[i], values_virginica[i]])
        y[i] ~ Categorical(v)
    end
end;

Sampling

Now we can run our sampler. This time we’ll use HMC to sample from our posterior.

chain = sample(logistic_regression(train_features, train_target, 1), HMC(0.05, 10), MCMCThreads(), 1500, 3)
Chains MCMC chain (1500×19×3 Array{Float64,3}):

Iterations        = 1:1500
Thinning interval = 1
Chains            = 1, 2, 3
Samples per chain = 1500
parameters        = coefficients_versicolor[1], coefficients_versicolor[2], coefficients_versicolor[3], coefficients_versicolor[4], coefficients_virginica[1], coefficients_virginica[2], coefficients_virginica[3], coefficients_virginica[4], intercept_versicolor, intercept_virginica
internals         = acceptance_rate, hamiltonian_energy, hamiltonian_energy_error, is_accept, log_density, lp, n_steps, nom_step_size, step_size

Summary Statistics
                  parameters      mean       std   naive_se      mcse        ess      rhat  
                      Symbol   Float64   Float64    Float64   Float64    Float64   Float64  
                                                                                            
  coefficients_versicolor[1]    1.5404    0.6753     0.0101    0.0335   332.4769    1.0017  
  coefficients_versicolor[2]   -1.4298    0.5098     0.0076    0.0171   786.5622    1.0015  
  coefficients_versicolor[3]    1.1382    0.7772     0.0116    0.0398   328.8508    1.0091  
  coefficients_versicolor[4]    0.0693    0.7300     0.0109    0.0374   368.3007    1.0048  
   coefficients_virginica[1]    0.4251    0.6983     0.0104    0.0294   381.6545    1.0017  
   coefficients_virginica[2]   -0.6744    0.6036     0.0090    0.0250   654.1030    1.0012  
   coefficients_virginica[3]    2.0076    0.8424     0.0126    0.0390   344.6077    1.0067  
   coefficients_virginica[4]    2.6704    0.7982     0.0119    0.0423   337.9600    1.0043  
        intercept_versicolor    0.8408    0.5257     0.0078    0.0167   874.4821    1.0044  
         intercept_virginica   -0.7351    0.6639     0.0099    0.0285   525.8135    1.0039  

Quantiles
                  parameters      2.5%     25.0%     50.0%     75.0%     97.5%  
                      Symbol   Float64   Float64   Float64   Float64   Float64  
                                                                                
  coefficients_versicolor[1]    0.2659    1.0755    1.5231    1.9860    2.9059  
  coefficients_versicolor[2]   -2.4714   -1.7610   -1.4109   -1.0749   -0.4921  
  coefficients_versicolor[3]   -0.4377    0.6358    1.1456    1.6500    2.6215  
  coefficients_versicolor[4]   -1.3741   -0.4381    0.0652    0.5711    1.4808  
   coefficients_virginica[1]   -0.9452   -0.0487    0.4287    0.8991    1.7973  
   coefficients_virginica[2]   -1.8717   -1.0756   -0.6641   -0.2501    0.4867  
   coefficients_virginica[3]    0.3740    1.4180    1.9941    2.5862    3.6788  
   coefficients_virginica[4]    1.1985    2.1347    2.6359    3.1795    4.3502  
        intercept_versicolor   -0.1652    0.4888    0.8340    1.1858    1.8891  
         intercept_virginica   -2.0101   -1.1944   -0.7453   -0.2834    0.5836  

Since we ran multiple chains, we may as well do a spot check to make sure each chain converges around similar points.

plot(chain)

svg

Looks good!

We can also use the corner function from MCMCChains to show the distributions of the various parameters of our multinomial logistic regression. The corner function requires MCMCChains and StatsPlots.

corner(
    chain, [Symbol("coefficients_versicolor[$$i]") for i in 1:4];
    label=[string(i) for i in 1:4], fmt=:png
)

png

corner(
    chain, [Symbol("coefficients_virginica[$$i]") for i in 1:4];
    label=[string(i) for i in 1:4], fmt=:png
)

png

Fortunately the corner plots appear to demonstrate unimodal distributions for each of our parameters, so it should be straightforward to take the means of each parameter’s sampled values to estimate our model to make predictions.

Making Predictions

How do we test how well the model actually predicts whether someone is likely to default? We need to build a prediction function that takes the test dataset and runs it through the average parameter calculated during sampling.

The prediction function below takes a Matrix and a Chains object. It computes the mean of the sampled parameters and calculates the species with the highest probability for each observation. Note that we do not have to evaluate the softmax function since it does not affect the order of its inputs.

function prediction(x::Matrix, chain)
    # Pull the means from each parameter's sampled values in the chain.
    intercept_versicolor = mean(chain, :intercept_versicolor)
    intercept_virginica = mean(chain, :intercept_virginica)
    coefficients_versicolor = [mean(chain, "coefficients_versicolor[$$i]") for i in 1:4]
    coefficients_virginica = [mean(chain, "coefficients_virginica[$$i]") for i in 1:4]

    # Compute the index of the species with the highest probability for each observation.
    values_versicolor = intercept_versicolor .+ x * coefficients_versicolor
    values_virginica = intercept_virginica .+ x * coefficients_virginica
    species_indices = [argmax((0, x, y)) for (x, y) in zip(values_versicolor, values_virginica)]
    
    return species_indices
end;

Let’s see how we did! We run the test matrix through the prediction function, and compute the accuracy for our prediction.

# Make the predictions.
predictions = prediction(test_features, chain)

# Calculate accuracy for our test set.
mean(predictions .== testset[!, :Species_index])
0.8533333333333334

Perhaps more important is to see the accuracy per class.

for s in 1:3
    rows = testset[!, :Species_index] .== s
    println("Number of `", species[s], "`: ", count(rows))
    println("Percentage of `", species[s], "` predicted correctly: ",
        mean(predictions[rows] .== testset[rows, :Species_index]))
end
Number of `setosa`: 22
Percentage of `setosa` predicted correctly: 1.0
Number of `versicolor`: 24
Percentage of `versicolor` predicted correctly: 0.875
Number of `virginica`: 29
Percentage of `virginica` predicted correctly: 0.7241379310344828

This tutorial has demonstrated how to use Turing to perform Bayesian multinomial logistic regression.