AI for Medical Diagnosis - Lab 2
Counting Labels
Class Imbalance can affect the loss function as it weights the losses different for each class. To choose the weights we must first calculate the class frequencies.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Read in csv file into a df
train_df = pd.read_csv('/content/train-small.csv')
# Display first 5 elements of the df
train_df.head()
Image | Atelectasis | Cardiomegaly | Consolidation | Edema | Effusion | Emphysema | Fibrosis | Hernia | Infiltration | Mass | Nodule | PatientId | Pleural_Thickening | Pneumonia | Pneumothorax | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 00008270_015.png | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8270 | 0 | 0 | 0 |
1 | 00029855_001.png | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 29855 | 0 | 0 | 0 |
2 | 00001297_000.png | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1297 | 1 | 0 | 0 |
3 | 00012359_002.png | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12359 | 0 | 0 | 0 |
4 | 00017951_001.png | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 17951 | 0 | 0 | 0 |
# Create a new df to count classes and drop non-class columns
class_counts = train_df.sum().drop(['Image', 'PatientId'])
class_counts
Atelectasis 106
Cardiomegaly 20
Consolidation 33
Edema 16
Effusion 128
Emphysema 13
Fibrosis 14
Hernia 2
Infiltration 175
Mass 45
Nodule 54
Pleural_Thickening 21
Pneumonia 10
Pneumothorax 38
dtype: object
# Print class counts more descriptively
for c in class_counts.keys():
print(f"The class {c} has {train_df[c].sum()} samples.\n")
print(f"For a total of: {class_counts.values.sum()} samples.")
The class Atelectasis has 106 samples.
The class Cardiomegaly has 20 samples.
The class Consolidation has 33 samples.
The class Edema has 16 samples.
The class Effusion has 128 samples.
The class Emphysema has 13 samples.
The class Fibrosis has 14 samples.
The class Hernia has 2 samples.
The class Infiltration has 175 samples.
The class Mass has 45 samples.
The class Nodule has 54 samples.
The class Pleural_Thickening has 21 samples.
The class Pneumonia has 10 samples.
The class Pneumothorax has 38 samples.
For a total of: 675 samples.
Data Visualization
Plot the Distribution of Counts
sns.set()
sns.barplot(class_counts.values, class_counts.index, palette="Blues_d")
plt.title("Distribution of Classes for Training Dataset", fontsize=15)
plt.xlabel("Number of Patients", fontsize=15)
plt.ylabel("Diseases", fontsize=15)
plt.show();
Weighted Loss Function
Define a hypothetical set of true labels and then a set of random predictions. These samples can then be used to calculate the weighted loss function.
# Generate an np array of 10 labels
# 7 positive and 3 negative -- then reshape array to a column
y_true = np.array([1, 1, 1, 1, 1, 1, 1, 0, 0, 0]).reshape(10, 1)
print(y_true, y_true.shape)
[[1]
[1]
[1]
[1]
[1]
[1]
[1]
[0]
[0]
[0]] (10, 1)
Define positive and negative weights to be used in the loss function The positive weight is determined by the number of negative cases over the total number of cases: \(3 / 10\)
The negative weight is determined by the number of positive cases over the total number of cases: \(7 / 10\)
We also need to define the value of epsilon which is a small positive value we used to prevent erros when calculating the \(log\) of zero.
# Define positive and negative weights
positive_weight = 3/10
negative_weight = 7/10
# Define epsilon
epsilon = 1e-7
Weighted Loss Equation
Calculate the loss for the zero-th label (column at index 0)
- The loss is made up of two terms:
- $loss_{pos}$: we’ll use this to refer to the loss where the actual label is positive (the positive examples).
- $loss_{neg}$: we’ll use this to refer to the loss where the actual label is negative (the negative examples).
- Note that within the $log()$ function, we’ll add a tiny positive value, to avoid an error if taking the log of zero.
# Calculate positive loss
positive_loss = -1 * np.sum(positive_weight *
y_true *
np.log(y_predict + epsilon)
)
positive_loss
19.436539145242033
# Calculate negative loss
negative_loss = -1 * np.sum(negative_weight *
(1 - y_true) *
np.log(1 - y_predict + epsilon)
)
negative_loss
1.611808725096189
# Calculate total loss
total_loss = positive_loss + negative_loss
total_loss
21.048347870338223