General Instruction
This assignment is due by March 29, 2024. Late submission is allowed, but penalties will apply (see Syllabus).
Plagiarism is prohibited. Once found, it leads to zero points for this assignment and immediate reporting to the Office of Student Accountability (OSA).
This is a coding assignment. Students are strongly encouraged to submit a Jupyter notebook. Other types of source codes are acceptable, but it is the student’s responsibility to provide clear instructions on how to compile or execute the program. If your implementation has multiple files, please submit one .zip file that contains all. You can also submit the URL if your code is hosted online.
Generative AI tools (e.g., chatGPT) can be used for assistance in this submission.
You only need to choose and work on one task for this assignment.
Task 1. Implement MLP with PyTorch
Objective
To acquire a better understanding of neural networks by using a public domain software package called PyTorch.
Task Overview
This assignment consists of the following tasks:
To install and learn to use PyTorch
To train multilayer perception (MLP), neural network models, using all data sets provided, including five binary data sets and one multi-class data set.
To write up a report
Please read the following elaboration for each small task
Subtask 1. PyTorch Installation
Detailed guides on installing PyTorch for different computing platforms can be found at the website https://pytorch.org/get-started/locally/.Links to an external site. Installing PyTorch using Anaconda or PIP is the recommended way since it is simple and straightforward.
For this assignment, it suffices to use the CPU-only version (higher than 1.0) of TensorFlow with Python 3.x. You may also use PyTorch with GPU support if you have a powerful GPU.
You may also need to install matplotlib for plotting figures. However, other high-level machine learning frameworks such as Keras should not be used for this assignment.
Data SETS to use
Five Binary Classification Data Sets.
You will use five binary classification data sets that are available in this ZIP file (Task 1 Dataset-1.zip Download Task 1 Dataset-1.zip). The following table shows the number of features, the number of training examples, and the number of test examples for each data set.
Dataset # of Features # of Train # of Test
Breast cancer 10 547 136
Diabetes 8 615 153
Digit 64 800 200
Iris 4 120 30
Wine 13 142 36
When you load each .npz data file, you will find four NumPy arrays train X, train Y, test X, and test Y. Each row of X stores the features of one example and the corresponding row of Y stores its class label (0 or 1). As is always the case, the class label files for the test sets should not be used for classifier training but only for measuring the classification accuracy of the test data.
Multi-class Data Sets.
You can also find this dataset in the same ZIP file (as above). The dataset consists of a training set of 10,000 examples and a test set of 1,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. Please load the data set with scipy such as from scipy.io import loadmat.
Subtask 2. Train and Test the Neural Network Models
As discussed in class, neural network classifiers generalize logistic regression by introducing one or more hidden layers. The learning of both models may use a (batch or stochastic) gradient descent algorithm by minimizing the cross-entropy loss. It requires that the step size parameter be specified. Try out a few values and choose one that leads to stable convergence. You may also decrease gradually during the learning process to enhance convergence. A common criterion used for early stopping is when the improvement between iterations does not exceed a small threshold or when the number of iterations has reached a prespecified maximum. Since the solution found may depend on the initial weight values chosen randomly, you may repeat each setting multiple times and report the average classification accuracy.
Neural Networks Models for Binary Classification Data Sets
For 5 binary classification datasets, you are required to construct a set of single hidden layer neural network models. The number of hidden units H should be determined using cross-validation. The generalization performance of the model is estimated for each candidate’s value of H
∈
{
1
,
2
,
.
.
.
,
10
}
. This is done by randomly sampling 80% of the training instances to train a classifier and then testing it on the remaining 20%. Hence, given 5 datasets, you are required to perform 5 such random data splits in order to find the most suitable H for 5 data sets, respectively.
Subsequently, given any binary data set, you are required to train a neural network classifier with the most suitable H hidden units in a single layer, which is trained from scratch using all the training instances available.
Neural Networks Models for Multi-class Data Sets
For the multi-class data set, you are required to construct a set of two hidden layers of neural network models. The number of hidden units for first layer L1 is chosen from {50, 75, 100}, while the number of hidden units for second layer L2 is chosen from {10,15,20}. Like the single-layer NNMs above, the best combination of hidden numbers for each layer is decided by cross-validation (randomly sampling 80% of the training instances to train a classifier and then testing it on the remaining 20%).
Subtask 3. Model Evaluation
You are expected to present a brief summary of the parameter settings and the experiment results (either by the print() function or documented in the comments). Besides reporting the classification accuracy (for both training and test data) in numbers, graphical aids should also be used to compare the performance of different settings visually. Some utilities in scikit-learn such as AUC and confusion matrix are recommended for analyzing and reporting the experiment results. For the CPU time information, you may just report it in numbers.
Assessment Points
Build neural network models and adopt the gradient descent optimization algorithm.
Tune the parameters using cross-validation techniques.
Compute the cross-entropy loss and the accuracy of the neural network model on both the training and test sets.
Present the experiment settings of the neural network models such as optimizer and learning rate.
Report the parameter tuning result of the neural network model using cross-validation.
Report the accuracy of the selected neural network models on the test set.
Task 2. Comparison of Classifiers
Task Description
You are required to implement the following classifiers and compare the performance achieved by different classifiers.
Decision Tree: You should construct decision trees on the dataset in terms of entropy and Gini criteria, respectively. For each criterion, you should set the depth as [5,10,15,20,25] separately. You need to compare the performance (accuracy, precision, recall, f1 score, and training time) and give a brief discussion.
KNN, Random Forest: Apply two extra different classifiers KNN and Random Forest on the dataset.
Model Evaluation
For each classifier, evaluate the performance (accuracy, precision, recall, f1 score, and training time). You are required to compare the performance of different classifiers and give a brief discussion.
You can summarize your results with a table (as shown below, or a similar one) followed by a brief discussion, e.g., specify how parameters are selected, and if any trend is observed in parameter tuning.
Classifier Accuracy Precision Recall F1 Training Time
Data Description
We use the letter dataset from Statlog(Task 2 Dataset-1.zip Download Task 2 Dataset-1.zip). The statistics of the dataset are shown in the following table. The class number of the dataset is 26.
Size # of Features
Training 15,000 16
Testing 5,000 16
Note
In this question, you are FREE to use different Python libraries for implementation. The problem in this assignment is a multi-class classification. All the metrics (accuracy, precision, recall, f1 score, and training time) should be the average results over the entire test set. For classifiers without specified parameters (like KNN, Random Forest), you are free to adjust parameters by yourself.
Assessment Points
Construct the decision tree based on different criteria and parameter setting
Use KNN classifier & random forest classifier for the classification
Clear presentation of the metrics for different classification methods with brief discussions.General Instruction
This assignment is due by March 29, 2024. Late submission is allowed, but penalties will apply (see Syllabus).
Plagiarism is prohibited. Once found, it leads to zero points for this assignment and immediate reporting to the Office of Student Accountability (OSA).
This is a coding assignment. Students are strongly encouraged to submit a Jupyter notebook. Other types of source codes are acceptable, but it is the student’s responsibility to provide clear instructions on how to compile or execute the program. If your implementation has multiple files, please submit one .zip file that contains all. You can also submit the URL if your code is hosted online.
Generative AI tools (e.g., chatGPT) can be used for assistance in this submission.
You only need to choose and work on one task for this assignment.
Task 1. Implement MLP with PyTorch
Objective
To acquire a better understanding of neural networks by using a public domain software package called PyTorch.
Task Overview
This assignment consists of the following tasks:
To install and learn to use PyTorch
To train multilayer perception (MLP), neural network models, using all data sets provided, including five binary data sets and one multi-class data set.
To write up a report
Please read the following elaboration for each small task
Subtask 1. PyTorch Installation
Detailed guides on installing PyTorch for different computing platforms can be found at the website https://pytorch.org/get-started/locally/.Links to an external site. Installing PyTorch using Anaconda or PIP is the recommended way since it is simple and straightforward.
For this assignment, it suffices to use the CPU-only version (higher than 1.0) of TensorFlow with Python 3.x. You may also use PyTorch with GPU support if you have a powerful GPU.
You may also need to install matplotlib for plotting figures. However, other high-level machine learning frameworks such as Keras should not be used for this assignment.
Data SETS to use
Five Binary Classification Data Sets.
You will use five binary classification data sets that are available in this ZIP file (Task 1 Dataset-1.zip Download Task 1 Dataset-1.zip). The following table shows the number of features, the number of training examples, and the number of test examples for each data set.
Dataset # of Features # of Train # of Test
Breast cancer 10 547 136
Diabetes 8 615 153
Digit 64 800 200
Iris 4 120 30
Wine 13 142 36
When you load each .npz data file, you will find four NumPy arrays train X, train Y, test X, and test Y. Each row of X stores the features of one example and the corresponding row of Y stores its class label (0 or 1). As is always the case, the class label files for the test sets should not be used for classifier training but only for measuring the classification accuracy of the test data.
Multi-class Data Sets.
You can also find this dataset in the same ZIP file (as above). The dataset consists of a training set of 10,000 examples and a test set of 1,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes. Please load the data set with scipy such as from scipy.io import loadmat.
Subtask 2. Train and Test the Neural Network Models
As discussed in class, neural network classifiers generalize logistic regression by introducing one or more hidden layers. The learning of both models may use a (batch or stochastic) gradient descent algorithm by minimizing the cross-entropy loss. It requires that the step size parameter be specified. Try out a few values and choose one that leads to stable convergence. You may also decrease gradually during the learning process to enhance convergence. A common criterion used for early stopping is when the improvement between iterations does not exceed a small threshold or when the number of iterations has reached a prespecified maximum. Since the solution found may depend on the initial weight values chosen randomly, you may repeat each setting multiple times and report the average classification accuracy.
Neural Networks Models for Binary Classification Data Sets
For 5 binary classification datasets, you are required to construct a set of single hidden layer neural network models. The number of hidden units H should be determined using cross-validation. The generalization performance of the model is estimated for each candidate’s value of H
∈
{
1
,
2
,
.
.
.
,
10
}
. This is done by randomly sampling 80% of the training instances to train a classifier and then testing it on the remaining 20%. Hence, given 5 datasets, you are required to perform 5 such random data splits in order to find the most suitable H for 5 data sets, respectively.
Subsequently, given any binary data set, you are required to train a neural network classifier with the most suitable H hidden units in a single layer, which is trained from scratch using all the training instances available.
Neural Networks Models for Multi-class Data Sets
For the multi-class data set, you are required to construct a set of two hidden layers of neural network models. The number of hidden units for first layer L1 is chosen from {50, 75, 100}, while the number of hidden units for second layer L2 is chosen from {10,15,20}. Like the single-layer NNMs above, the best combination of hidden numbers for each layer is decided by cross-validation (randomly sampling 80% of the training instances to train a classifier and then testing it on the remaining 20%).
Subtask 3. Model Evaluation
You are expected to present a brief summary of the parameter settings and the experiment results (either by the print() function or documented in the comments). Besides reporting the classification accuracy (for both training and test data) in numbers, graphical aids should also be used to compare the performance of different settings visually. Some utilities in scikit-learn such as AUC and confusion matrix are recommended for analyzing and reporting the experiment results. For the CPU time information, you may just report it in numbers.
Assessment Points
Build neural network models and adopt the gradient descent optimization algorithm.
Tune the parameters using cross-validation techniques.
Compute the cross-entropy loss and the accuracy of the neural network model on both the training and test sets.
Present the experiment settings of the neural network models such as optimizer and learning rate.
Report the parameter tuning result of the neural network model using cross-validation.
Report the accuracy of the selected neural network models on the test set.
Task 2. Comparison of Classifiers
Task Description
You are required to implement the following classifiers and compare the performance achieved by different classifiers.
Decision Tree: You should construct decision trees on the dataset in terms of entropy and Gini criteria, respectively. For each criterion, you should set the depth as [5,10,15,20,25] separately. You need to compare the performance (accuracy, precision, recall, f1 score, and training time) and give a brief discussion.
KNN, Random Forest: Apply two extra different classifiers KNN and Random Forest on the dataset.
Model Evaluation
For each classifier, evaluate the performance (accuracy, precision, recall, f1 score, and training time). You are required to compare the performance of different classifiers and give a brief discussion.
You can summarize your results with a table (as shown below, or a similar one) followed by a brief discussion, e.g., specify how parameters are selected, and if any trend is observed in parameter tuning.
Classifier Accuracy Precision Recall F1 Training Time
Data Description
We use the letter dataset from Statlog(Task 2 Dataset-1.zip Download Task 2 Dataset-1.zip). The statistics of the dataset are shown in the following table. The class number of the dataset is 26.
Size # of Features
Training 15,000 16
Testing 5,000 16
Note
In this question, you are FREE to use different Python libraries for implementation. The problem in this assignment is a multi-class classification. All the metrics (accuracy, precision, recall, f1 score, and training time) should be the average results over the entire test set. For classifiers without specified parameters (like KNN, Random Forest), you are free to adjust parameters by yourself.
Assessment Points
Construct the decision tree based on different criteria and parameter setting
Use KNN classifier & random forest classifier for the classification
Clear presentation of the metrics for different classification methods with brief discussions.
A brief comparison of different models and a conclusion
Codes are properly commented on and easy to follow
A brief comparison of different models and a conclusion
Codes are properly commented on and easy to follow