AI for Medical Diagnosis - Lab 2

3 minute read

Counting Labels

Class Imbalance can affect the loss function as it weights the losses different for each class. To choose the weights we must first calculate the class frequencies.

# Import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns


# Read in csv file into a df

train_df = pd.read_csv('/content/train-small.csv')

# Display first 5 elements of the df

train_df.head()
Image Atelectasis Cardiomegaly Consolidation Edema Effusion Emphysema Fibrosis Hernia Infiltration Mass Nodule PatientId Pleural_Thickening Pneumonia Pneumothorax
0 00008270_015.png 0 0 0 0 0 0 0 0 0 0 0 8270 0 0 0
1 00029855_001.png 1 0 0 0 1 0 0 0 1 0 0 29855 0 0 0
2 00001297_000.png 0 0 0 0 0 0 0 0 0 0 0 1297 1 0 0
3 00012359_002.png 0 0 0 0 0 0 0 0 0 0 0 12359 0 0 0
4 00017951_001.png 0 0 0 0 0 0 0 0 1 0 0 17951 0 0 0
# Create a new df to count classes and drop non-class columns 

class_counts = train_df.sum().drop(['Image', 'PatientId'])
class_counts
Atelectasis           106
Cardiomegaly           20
Consolidation          33
Edema                  16
Effusion              128
Emphysema              13
Fibrosis               14
Hernia                  2
Infiltration          175
Mass                   45
Nodule                 54
Pleural_Thickening     21
Pneumonia              10
Pneumothorax           38
dtype: object
# Print class counts more descriptively

for c in class_counts.keys():
  print(f"The class {c} has {train_df[c].sum()} samples.\n")
  
print(f"For a total of: {class_counts.values.sum()} samples.")
The class Atelectasis has 106 samples.

The class Cardiomegaly has 20 samples.

The class Consolidation has 33 samples.

The class Edema has 16 samples.

The class Effusion has 128 samples.

The class Emphysema has 13 samples.

The class Fibrosis has 14 samples.

The class Hernia has 2 samples.

The class Infiltration has 175 samples.

The class Mass has 45 samples.

The class Nodule has 54 samples.

The class Pleural_Thickening has 21 samples.

The class Pneumonia has 10 samples.

The class Pneumothorax has 38 samples.

For a total of: 675 samples.

Data Visualization

Plot the Distribution of Counts

sns.set()
sns.barplot(class_counts.values, class_counts.index, palette="Blues_d")
plt.title("Distribution of Classes for Training Dataset", fontsize=15)
plt.xlabel("Number of Patients", fontsize=15)
plt.ylabel("Diseases", fontsize=15)
plt.show();

Plot Dist of Counts

Weighted Loss Function

Define a hypothetical set of true labels and then a set of random predictions. These samples can then be used to calculate the weighted loss function.

# Generate an np array of 10 labels
# 7 positive and 3 negative -- then reshape array to a column

y_true = np.array([1, 1, 1, 1, 1, 1, 1, 0, 0, 0]).reshape(10, 1)
print(y_true, y_true.shape)
[[1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]] (10, 1)

Define positive and negative weights to be used in the loss function The positive weight is determined by the number of negative cases over the total number of cases: \(3 / 10\)

The negative weight is determined by the number of positive cases over the total number of cases: \(7 / 10\)

We also need to define the value of epsilon which is a small positive value we used to prevent erros when calculating the \(log\) of zero.

# Define positive and negative weights

positive_weight = 3/10

negative_weight = 7/10

# Define epsilon

epsilon = 1e-7

Weighted Loss Equation

Calculate the loss for the zero-th label (column at index 0)

  • The loss is made up of two terms:
    • $loss_{pos}$: we’ll use this to refer to the loss where the actual label is positive (the positive examples).
    • $loss_{neg}$: we’ll use this to refer to the loss where the actual label is negative (the negative examples).
  • Note that within the $log()$ function, we’ll add a tiny positive value, to avoid an error if taking the log of zero.
\[loss^{(i)} = loss_{pos}^{(i)} + los_{neg}^{(i)}\] \[loss_{pos}^{(i)} = -1 \times weight_{pos}^{(i)} \times y^{(i)} \times log(\hat{y}^{(i)} + \epsilon)\] \[loss_{neg}^{(i)} = -1 \times weight_{neg}^{(i)} \times (1- y^{(i)}) \times log(1 - \hat{y}^{(i)} + \epsilon)\] \[\epsilon = \text{a tiny positive number}\]
# Calculate positive loss

positive_loss = -1 * np.sum(positive_weight * 
                            y_true * 
                            np.log(y_predict + epsilon)
              )

positive_loss

19.436539145242033

# Calculate negative loss

negative_loss = -1 * np.sum(negative_weight * 
                            (1 - y_true) * 
                            np.log(1 - y_predict + epsilon)
              )

negative_loss

1.611808725096189

# Calculate total loss 

total_loss = positive_loss + negative_loss
total_loss

21.048347870338223

Tags:

Updated: