Classification Random Forest

Project information

Category: Machine Learning, Classification
Source Data: Download

Project Details

Description

Parkinson's disease is a brain disorder that causes unintended or uncontrollable movements, such as shaking, stiffness, and difficulty with balance and coordination. Symptoms usually begin gradually and worsen over time. As the disease progresses, people may have difficulty walking and talking. The dataset contains two classes of status: 1 meaning yes and 0 meaning no and each class has 24 features.

Data Description:

name - ASCII subject name and recording number

MDVP:Fo(Hz) - Average vocal fundamental frequency

MDVP:Fhi(Hz) - Maximum vocal fundamental frequency

MDVP:Flo(Hz) - Minimum vocal fundamental frequency

MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several

measures of variation in fundamental frequency

MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude

NHR,HNR - Two measures of ratio of noise to tonal components in the voice

status - Health status of the subject (one) - Parkinson's, (zero) - healthy

RPDE,D2 - Two nonlinear dynamical complexity measures

DFA - Signal fractal scaling exponent

spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

The goal of the Parkinson’s disease is to train the random forest model, analyze importance of variables, tuning the accuracy of model, predict status based on their features.

As the dataset have labeled the variable, it is supervised machine learning. Supervised machine learning are types of machine learning that are trained on well-labeled training data. Labeled data means that training data is already tagged with correct output.

In this project, I will solve the problem using the algorithm: Random Forest as it is a supervised machine learning algorithm which analyzes data for classification and regression. Also, Random Forest Classification is one of the most robust classification methods.

How Random Forest Algorithms work?

Random forest algorithms have three main hyperparameters, which need to be set before training. These include node size, the number of trees, and the number of features sampled. From there, the random forest classifier can be used to solve for regression or classification problems.

The random forest algorithm is made up of a collection of decision trees, and each tree in the ensemble is comprised of a data sample drawn from a training set with replacement, called the bootstrap sample. Of that training sample, one-third of it is set aside as test data, known as the out-of-bag (oob) sample, which we’ll come back to later. Another instance of randomness is then injected through feature bagging, adding more diversity to the dataset and reducing the correlation among decision trees. Depending on the type of problem, the determination of the prediction will vary. For a regression task, the individual decision trees will be averaged, and for a classification task, a majority vote—i.e. the most frequent categorical variable—will yield the predicted class. Finally, the oob sample is then used for cross-validation, finalizing that prediction.

Python Packages:

1. Matplotlib

2. Seaborn

3. Pandas

4. Scikit-learn

Roadmap:

1. Load the data

2. Analyze and visualize the dataset

3. Split a dataset into training and testing datasets

4. Train the model

5. Model Evaluation

6. Testing the model

7. Improving performance

Step 1:

Import packages:

#import libaries
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report

First, I imported necessary packages for this project.

Pandas will used to load data from various sources like local or cloud storage, database, excel file, CSV file and so on.

Matplotlib and seaboarn will be used for data visualization.

Sklearn will be used for data split, accuracy report and modeling.

url=r'url'
df=pd.read_csv(url)
print(df.head())

As I use the data from web server, I define the url variable and then use read_csv to read data and set the column name as per the Parkinsons data information.

Look at the first five-row data in the dataset.

Step 2: Analyze and visualize the dataset

# display base statistical analysis regarding the data
print(df.describe())

From this data description, I can see all the data descriptions on the data, like mean in different measurements, min and max values and 25%, 50% and 75% distribution values.

Let’s visualize the dataset.

Seaborn pair plot method helps visualize the whole dataset.

# visualize the dataset by labeled variable
sns.pairplot(df, hue='status')
plt.savefig('parkinsons variable analysis')

By looking the visualization, it is very clear that those whose status is yes have all values higher than those whose status is no.

# Separate features and target
X=df.loc[:,df.columns!='status'].values[:,1:]
Y=df.loc[:,'status'].values
feature=df.drop('status', axis=1)
feature_list=list(feature.columns)
Y = data[:,4]

X is featuring dataset including all variable used to train the model.

Y is labeled dataset including target values.

Step3: Split a dataset into training and testing datasets

# Split the dataset
X_train,X_test,y_train,y_test=train_test_split(X, Y, test_size=0.2, random_state=42)

I split the whole dataset into two groups: training dataset and testing dataset using train_test_split by 80% verse 20%. I will use the testing dataset to check the accuracy of the mode later.

Step3: Train the model

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

I split the whole dataset into two groups: training dataset and testing dataset using train_test_split by 80% verse 20%. I will use testing dataset to check the accuracy of the mode later.

Step 4: Train the model

model=RandomForestClassifier(n_estimators=10,random_state=42)
model.fit(X_train,y_train)

Initialize random forest classifier class from scikit-learn package to model variable.

Set the number of estimators as 10 which can be tuned to improve accuracy. Later I will talk about how to use elbow method.

I feed training dataset and labeled dataset into the algorithm using fit() method.

Step 5: Model Evaluation

pred=model.predict(X_test)
print('First Run Performance:'+str(accuracy_score(y_test,pred)*100))

I predict the classes from testing dataset using my trained model.

Check the accuracy score of the predicted results.

Accuracy_score takes true values and predicted values and return percentage of accuracy.

The accuracy is above 92%.

I also show the detailed classification report based the testing dataset.

The report indicates detailed information of the prediction.

Precision defines the ration of true positives to the sum of true positive and false positive.

Recall defines the ratio of true positive to sum of true positive and false negative.

F1-score is the mean of precision and recall results.

Support is the number of actual occurrences of the class in the testing dataset.

Accuracy is the average of f1-scores.

Step 6: Testing the Model

#Testing the model
X_new = np.array([[114.38000,130.10900,104.63400,0.00332,0.00003,0.00160,0.00199,0.00480,0.01503,0.13700,0.00812,0.00933,0.01133,0.02436,0.00401,26.00500,0.405991,0.761255,-5.966779,0.197938,1.974857,0.184067],
                  [117.38800,125.03800,110.97000,0.00346,0.00003,0.00169,0.00213,0.00507,0.01725,0.15500,0.00874,0.01021,0.01331,0.02623,0.00415,26.14300,0.361232,0.763242,-6.016891,0.109256,2.004719,0.174429],
                  [151.73700,190.20400,129.85900,0.00314,0.00002,0.00135,0.00152,0.00506,0.01469,0.13200,0.00728,0.00886,0.01230,0.02184,0.00570,24.15100,0.396610,0.745957,-6.486822,0.197919,2.449763,0.132703]])
pred_new = model.predict(X_new)
# expected results:0,0,1
print("Prediction of Species: {}".format(pred_new))

I just randomly generated values based on the average plot to see if the model work correctly.

It looks like the model predicts correctly because the outputs meet expectations that several key features have much higher values than others.

Step 7: Improving Performance

# bagging estimate
estimator_range = [e for e in range(1,200)]
models = []
scores = []

for n_estimators in estimator_range:

    # Create bagging classifier
    clf = BaggingClassifier(n_estimators = n_estimators, random_state = 42)

    # Fit the model
    clf.fit(X_train,y_train)

    # Append the model and score to their respective list
    models.append(clf)
    scores.append(accuracy_score(y_true = y_test, y_pred = clf.predict(X_test)))

# Generate the plot of scores against number of estimators
plt.figure(figsize=(9,6))
plt.plot(estimator_range, scores)

# Adjust labels and font (to make visable)
plt.xlabel("n_estimators", fontsize = 18)
plt.ylabel("score", fontsize = 18)
plt.tick_params(labelsize = 16)

# Visualize plot
# plt.show()
plt.savefig('bagging estimation')

I used hyperparameter tuning. This is a complicated phrase that means “adjust the setting to improve the performance” (The setting are known as hyperparameters to differentiate them from model parameters learned during the training). In this example above, I used grid search method to get better performance.

Define the range from 1 to 200.
Use bagging classifier to try one by one and log the result for later analysis.
Plot the graph to assess the result and save it as output.

To quantify the usefulness of all the variables in the whole random forest, I can look at the relative importance of the variables. The returned importance in the package indicates how much including a specific variable improves the prediction. I can use the numbers to make relative comparison among variables.

The top 5 are MDVP:Fo(Hz) - Average vocal fundamental frequency, PPE: nonlinear measures of fundamental frequency variation, spread1: nonlinear measures of fundamental frequency variation, MDVP:Fhi(Hz) - Maximum vocal fundamental frequency, MDVP:Flo(Hz) - Minimum vocal fundamental frequency and MDVP:Jitter(Abs): the measures of variation in fundamental frequency. It tells me these play more important role than others in predicting status of Parkinsons.

# Visualizations
# Make a bar chart
name=[]
value=[]

for pair in feature_importances:
    name.append(pair[0])
    value.append(pair[1])

plt.style.use('fivethirtyeight')
plt.bar(name, value, color='maroon', orientation = 'vertical')
# Tick labels for x axis
plt.xticks(rotation='vertical')
# Axis labels and title
plt.ylabel('Importance'); plt.xlabel('Variable'); plt.title('Variable Importances');
plt.show()

I also visualized the importance of these variables to have clear understanding.

# Performence Improvement
# New random forest with only the two most important variables
rf_most_important = RandomForestClassifier(n_estimators=30,random_state=42)
# Extract the two most important features
important_indices=[feature_list.index('MDVP:Fo(Hz)'),feature_list.index('PPE'),feature_list.index('spread1'),feature_list.index('MDVP:Fhi(Hz)'),feature_list.index('MDVP:Flo(Hz)')
                  ,feature_list.index('MDVP:Jitter(Abs)'),feature_list.index('Jitter:DDP'),feature_list.index('spread2'),feature_list.index('D2')]
X_train_important=X_train[:,important_indices]
X_test_important=X_test[:,important_indices]
# Train the random forest
rf_most_important.fit(X_train_important,y_train)
# Make predictions and determine the error
predictions=rf_most_important.predict(X_test_important)
print('Performence Improvement:'+str(accuracy_score(y_test,predictions)*100))
print(classification_report(y_test, pred))

Based on findings on importance of variables and, I dropped some variables for performance improvement and re-ran the model and tested results. They are same but sometimes accuracy may be infected by dropping some variables.

Classification Project using Random Forest