Boosting

Boosting is the most common type of classes of algorithm performed globally to solve various kind of problems. It comprises of several weak learners or machine learning models which work together to compromise for the unlearnt pattern or miss-classified data points by the preceding model, through prioritizing those data points. One of the most commonly, rather simple boosting algorithm is ‘Adaboost’ which we will be discussing in detail in this blog. We will also be building the algorithm from scratch and see it performing to increase accuracy.

Before proceeding for Adaboost, let’s see an overview of boosting principle through the following image (Image source)

Among the above boxes, the boxes 1, 2 and 3 are classified by the models D1, D2, and D3 models which are individually not so effective for classification but when the are incorporated together as shown in box 4, they together perform the classification very accurately.

Similar thing happens in case of adaboost as well where weights are assigned with each prediction and weights are subsequently updated to a higher or lower value depending on the correctness of classification. We will further see each step of the algorithm along side practical implementation.

For our understanding let’s choose the ‘breast cancer’ dataset provided by sklearn library. Let’s first load and visualize the data.

df = pd.DataFrame(load_breast_cancer().data, columns=load_breast_cancer().feature_names)
df['label'] = load_breast_cancer().target

copy = df.copy()

df.head()

Since there are large number of columns in the data set, we will be basically focusing on the ‘label’ column and the upcoming other columns that will get appended adjacent to ‘label’ column.

Let’s use a Decision tree to make an initial prediction and just see the obtained accuracy.

dt = DecisionTreeClassifier(random_state=50, min_samples_split=100)
dt.fit(df.iloc[:,:-1],df.iloc[:,-1])
df['prediction'] = dt.predict(df.iloc[:,:-1])
df.iloc[:5,-2:]
print('Accuracy =',accuracy_score(df.label, df.prediction))
Accuracy = 0.945518453427065
We are getting an accuracy of around 95%.

Let’s see after applying adaboost, do we get more accuracy or not?

Step 1

Assign weight of 1/m for each data point, where m is the number of data points.

df['weight'] = 1/len(df)
df.iloc[:5,-3:]

Step 2

Finding number of wrongly classified data points.

no_of_errors = len(df[df.label != df.prediction1])
no_of_errors

There are 31 wrongly classified data points.

Calculating total error, where total error = no_of_errors/total number of data points

total_errors = no_of_errors/len(df)
total_errors

We get total error of 0.054481546572934976

Step 3

Calculating amount of say (α)

alpha = 0.5 * np.log((1-total_errors)/total_errors)
alpha

We get 1.426935677838319 as the amount of say.

Step 4

Weight updation using the following rule:

For correct prediction,

updated weight = old weight X eα

For incorrect prediction,

updated weight = old weight X eα

df['weight_updated'] = df.loc[df.label != df.prediction].weight * np.exp(alpha)
df.weight_updated = df['weight_updated'].fillna(df[df.label == df.prediction].weight * np.exp(-alpha))
df.iloc[:5,-4:]

The updated weights are now normalized, so that sum of all the values in the column is equal to 1.

df.weight_updated = df.weight_updated/df.weight_updated.sum()
df.iloc[:5,-4:]

Now, we can see that in updated weights column, the values for the incorrect predictions are significantly higher that the ones with correct predictions.

Step 5

Now ranges are created for each updated weight, which are nothing but the column wise cumulative addition for values. For eg., the range for index 0 is ‘0 to 0.016129’, for index 1 is ‘0.016123 to (0.016129+0.000929)’, for index 2 is ‘(0.016129+0.000929) to ((0.016129+0.000929) + 0.000929)’ and so on.

This way the value of range for the last index will reach 1 since after normalization, their sum has been equal to 1.

Step 6

Resampling of data is done. The process of resampling is that a random number between 0 and 1 is chosen. The range within which that number is falling, that specific index from ‘df’ dataframe is included in the new resampled dataframe. Since the wights of incorrect predictions were high, subsequently the ranges would be high as well. Therefore a lot of the randomly choosen numbers will fall in the incorrectly predicted data points’ ranges. And hence those data points will get repeated quite a large number of timed in the new resampled dataframe and will get more priority.

resampled = pd.DataFrame(columns=df.columns[:31])
for i in range(len(df)):
    index = df[df.ranges == df[np.random.rand()<df.ranges].ranges.min()].index
    resampled.loc[i] = list(df.iloc[index,:31].values[0])
    
resampled.head()

These resampled data is again used the same way being processed in step 1.

These 6 steps are repeated number of times till total error gets equal to 0 and the amount of say turns to infinite. For this purpose let’s build a function accumulating all the above codes to perform iterations.

def adaboost(df):
    dt = DecisionTreeClassifier(random_state=50, min_samples_split=100)
    dt.fit(df.iloc[:,:30],df.iloc[:,30])
    df['prediction'] = dt.predict(df.iloc[:,:30])
    
    df['weight'] = 1/len(df)
    
    no_of_errors = len(df[df.label != df.prediction])
    
    total_errors = no_of_errors/len(df)
    
    alpha = 0.5 * np.log((1-total_errors)/total_errors)
    
    df['weight_updated'] = df.loc[df.label != df.prediction].weight * np.exp(alpha)
    df.weight_updated = df['weight_updated'].fillna(df[df.label == df.prediction].weight * np.exp(-alpha))
    
    df.weight_updated = df.weight_updated/df.weight_updated.sum()
    
    p = 0
    for i in range(len(df)):
        df.loc[i,'ranges'] = df.loc[i,'weight_updated'] + p
        p = df.loc[i,'ranges']
        
    resampled = pd.DataFrame(columns=df.columns[:31])
    for i in range(len(df)):
        index = df[df.ranges == df[np.random.rand()<df.ranges].ranges.min()].index
        resampled.loc[i] = list(df.iloc[index,:31].values[0])  
    
    df = resampled
    
    return [df, dt]

The above function returns the resampled dataframe and the trained model for each iteration. This function is executed and the final resampled dataframe and the trained models from each iterations are stored.

df = copy.copy()

models = []    
    
try:
    for iter in range(20):        
        ada = adaboost(df)
        df = ada[0]    
        models.append(ada[1])
        print('Decision stamp {0}'.format(iter+1))
    
except Exception:
    pass

We obtained 10 decision stamps or models which will together make future predictions. Now these models will be used on the same dataset for which we initially got an accuracy of 95%. Let’s use these models on the same dataset and aggregate their outputs.

pred = np.zeros(len(df))
for i in range(len(models)):    
    pred += models[i].predict(copy.iloc[:,:-1])

pred

These are the aggregate of the predicted outputs of the models. For each mode, the output is either 1 or 0. Therefore in the above output array, number 2 means 2 models have classified it as 1, number 6 means 6 out of 10 models have classified it as 1.

Based on the number of models, the threshold is set to half the total number of models, i.e., if the number if any output is more than half the number of models used, it will be classified as 1 otherwise 0. Here in our case, output values with more than 5 are considered 1 and else are 0.

threshold = len(models)/2
vec = np.vectorize(lambda x: 1 if x>threshold else 0)
final_prediction = vec(pred)
final_prediction

Now based on the above output, the accuracy is calculated.

copy['final_prediction'] = final_prediction

print('Accuracy =',accuracy_score(copy.label, copy.final_prediction))
Accuracy = 0.9753954305799648
This time we obtained an accuracy of 98%.

Thus, we can see that we have successfully improved the accuracy using boosting.

Point to be noted that this entire process is for demonstration and understanding purpose only. I never encourage to use this process for solving real world problems. Instead you can use the automated sklearn Adaboost Classifier library for practical usage.

I hope I could explain the algorithm sufficiently enough for you to work with. Your valuable feedback is very much appreciated.

—END—

Leave a comment