MATH 494: Mathematical Foundations of Machine Learning

 

Assignment #4

 

(Due Monday, March 4, 2024)

 

 

NOTE: The description of the script to write is longer than the script itself 😊

 

In this assignment, you will be creating a small synthetic dataset, that is a set of data created by an algorithm that uses pseudo-random number generators to imitate/simulate real-life datasets. Synthetic datasets are widely used to validate and test ML methods and models. Another purpose of the assignment is to familiarize you with some practical aspects of working with a dataset and with basic statistical tools.

 

Since I want each student to produce different sets, you will be working again with numbers from your USD ID#. Let A be an integer whose value is the last digit of your USD ID#. Let B be the next-to-last digit. Note: If any of these numbers is 0, make it 9.

 

Your dataset should have 100 rows (observations) and 5 columns (features, dimensions). In the first column, you should generate numbers from the uniform distribution in the range (A, 2A). In the second column – uniform in the range (B, 2B).

 

The next three features will be linear combinations of the first two, but also with added Gaussian (normal) noise. The third column should be the sum of the values from the first two columns for the same observation. The fourth column should be the sum of the first and twice the second, and the fifth one the sum of twice the first plus the second. Finally, you should add Gaussian noise (error term) to each value in the third, fourth, and fifth columns. Each Gaussian term should have the mean of 0 and their standard deviations should be 0.1 of the column mean, 0.5 of the column mean, and exactly the column mean for columns number 3, 4, and 5, respectively.

 

When the dataset is generated, do the following:

 

 If you want some coding fun: don’t use any loops in the script; use vectorization instead…