Skip to main content
  1. Resources/
  2. Study Materials/
  3. Information Technology Engineering/
  4. IT Semester 4/
  5. Fundamentals of Machine Learning (4341603)/

Fundamentals of Machine Learning (4341603) - Summer 2024 Solution

·
Study-Material Solutions Machine-Learning 4341603 2024 Summer
Milav Dabgar
Author
Milav Dabgar
Experienced lecturer in the electrical and electronic manufacturing industry. Skilled in Embedded Systems, Image Processing, Data Science, MATLAB, Python, STM32. Strong education professional with a Master’s degree in Communication Systems Engineering from L.D. College of Engineering - Ahmedabad.
Table of Contents

Question 1(a) [3 marks]
#

Define Machine Learning using suitable example?

Answer:

Machine Learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed for every task.

Table: Key Components of Machine Learning

ComponentDescription
DataInput information used for training
AlgorithmMathematical model that learns patterns
TrainingProcess of teaching the algorithm
PredictionOutput based on learned patterns

Example: Email spam detection system learns from thousands of emails labeled as “spam” or “not spam” to automatically classify new emails.

Mnemonic: “Data Drives Decisions” - Data trains algorithms to make intelligent decisions

Question 1(b) [4 marks]
#

Explain the process of machine learning with the help of schematic representation

Answer:

The machine learning process involves systematic steps from data collection to model deployment.

flowchart TD
    A[Data Collection] --> B[Data Preprocessing]
    B --> C[Feature Selection]
    C --> D[Model Selection]
    D --> E[Training]
    E --> F[Validation]
    F --> G{Performance OK?}
    G -->|No| D
    G -->|Yes| H[Testing]
    H --> I[Deployment]

Process Steps:

  • Data Collection: Gathering relevant dataset
  • Preprocessing: Cleaning and preparing data
  • Training: Teaching algorithm using training data
  • Validation: Testing model performance
  • Deployment: Using model for real predictions

Mnemonic: “Computers Can Truly Think” - Collect, Clean, Train, Test

Question 1(c) [7 marks]
#

Explain different types of machine learning with suitable application.

Answer:

Machine learning algorithms are categorized based on learning approach and available data.

Table: Types of Machine Learning

TypeLearning MethodData RequirementExample Application
SupervisedUses labeled dataInput-output pairsEmail classification
UnsupervisedFinds hidden patternsOnly input dataCustomer segmentation
ReinforcementLearns through rewardsEnvironment feedbackGame playing AI

Applications:

  • Supervised Learning: Medical diagnosis, image recognition, fraud detection
  • Unsupervised Learning: Market research, anomaly detection, recommendation systems
  • Reinforcement Learning: Autonomous vehicles, robotics, strategic games

Diagram: Learning Types

mindmap
  root((Machine Learning))
    Supervised
      Classification
      Regression
    Unsupervised
      Clustering
      Association
    Reinforcement
      Policy Learning
      Value Function

Mnemonic: “Students Usually Remember” - Supervised, Unsupervised, Reinforcement

Question 1(c) OR [7 marks]
#

What are various issues with machine learning? List three problems that are not to be solved using machine learning.

Answer:

Table: Machine Learning Issues

Issue CategoryDescriptionImpact
Data QualityIncomplete, noisy, biased dataPoor model performance
OverfittingModel memorizes training dataPoor generalization
ComputationalHigh processing requirementsResource constraints
InterpretabilityBlack box modelsLack of transparency

Problems NOT suitable for ML:

  1. Simple rule-based tasks - Basic calculations, simple if-then logic
  2. Ethical decisions - Moral judgments requiring human values
  3. Creative expression - Original artistic creation requiring human emotion

Other Issues:

  • Privacy concerns: Sensitive data handling
  • Bias propagation: Unfair algorithmic decisions
  • Feature selection: Choosing relevant input variables

Mnemonic: “Data Drives Quality” - Data quality directly affects model quality

Question 2(a) [3 marks]
#

Give a summarized view of different types of data in a typical machine learning problem.

Answer:

Table: Data Types in Machine Learning

Data TypeDescriptionExample
NumericalQuantitative valuesAge: 25, Height: 170cm
CategoricalDiscrete categoriesColor: Red, Blue, Green
OrdinalOrdered categoriesRating: Poor, Good, Excellent
BinaryTwo possible valuesGender: Male/Female

Characteristics:

  • Structured: Organized in tables (databases, spreadsheets)
  • Unstructured: Images, text, audio files
  • Time-series: Data points over time

Mnemonic: “Numbers Count Better Than Words” - Numerical, Categorical, Binary, Text

Question 2(b) [4 marks]
#

Calculate variance for both attributes. Determine which attribute is spread out around mean.

Answer:

Given Data:

  • Attribute 1: 32, 37, 47, 50, 59
  • Attribute 2: 48, 40, 41, 47, 49

Calculations:

Attribute 1:

  • Mean = (32+37+47+50+59)/5 = 225/5 = 45
  • Variance = [(32-45)² + (37-45)² + (47-45)² + (50-45)² + (59-45)²]/5
  • Variance = [169 + 64 + 4 + 25 + 196]/5 = 458/5 = 91.6

Attribute 2:

  • Mean = (48+40+41+47+49)/5 = 225/5 = 45
  • Variance = [(48-45)² + (40-45)² + (41-45)² + (47-45)² + (49-45)²]/5
  • Variance = [9 + 25 + 16 + 4 + 16]/5 = 70/5 = 14

Result: Attribute 1 (variance = 91.6) is more spread out than Attribute 2 (variance = 14).

Mnemonic: “Higher Variance Shows Spread” - Greater variance indicates more dispersion

Question 2(c) [7 marks]
#

List Factors that lead to data quality issue. How to handle outliers and missing values.

Answer:

Table: Data Quality Issues

FactorCauseSolution
IncompletenessMissing data collectionImputation techniques
InconsistencyDifferent data formatsStandardization
InaccuracyHuman/sensor errorsValidation rules
NoiseRandom variationsFiltering methods

Handling Outliers:

  • Detection: Statistical methods (Z-score, IQR)
  • Treatment: Remove, transform, or cap extreme values
  • Visualization: Box plots, scatter plots

Handling Missing Values:

  • Deletion: Remove incomplete records
  • Imputation: Fill with mean, median, or mode
  • Prediction: Use ML to predict missing values

Code Example:

# Handle missing values
df.fillna(df.mean())  # Mean imputation
df.dropna()          # Remove missing rows

Mnemonic: “Clean Data Makes Models” - Clean data produces better models

Question 2(a) OR [3 marks]
#

Give different machine learning activities.

Answer:

Table: Machine Learning Activities

ActivityPurposeExample
Data CollectionGather relevant informationSurveys, sensors, databases
Data PreprocessingClean and prepare dataRemove noise, handle missing values
Feature EngineeringCreate meaningful variablesExtract features from raw data
Model TrainingTeach algorithm patternsUse training dataset
Model EvaluationAssess performanceTest accuracy, precision, recall
Model DeploymentPut model into productionWeb services, mobile apps

Key Activities:

  • Exploratory Data Analysis: Understanding data patterns
  • Hyperparameter Tuning: Optimizing model settings
  • Cross-validation: Robust performance assessment

Mnemonic: “Data Models Perform Excellently” - Data preparation, Model building, Performance evaluation, Execution

Question 2(b) OR [4 marks]
#

Calculate mean and median of the following numbers: 12,15,18,20,22,24,28,30

Answer:

Given numbers: 12, 15, 18, 20, 22, 24, 28, 30

Mean Calculation: Mean = (12+15+18+20+22+24+28+30)/8 = 169/8 = 21.125

Median Calculation:

  • Numbers are already sorted: 12, 15, 18, 20, 22, 24, 28, 30
  • Even count (8 numbers)
  • Median = (4th number + 5th number)/2 = (20 + 22)/2 = 21

Table: Statistical Summary

MeasureValueDescription
Mean21.125Average value
Median21Middle value
Count8Total numbers

Mnemonic: “Middle Makes Median” - Middle value gives median

Question 2(c) OR [7 marks]
#

Write a short note on dimensionality reduction and feature subset selection in context with data preprocessing.

Answer:

Dimensionality Reduction removes irrelevant features and reduces computational complexity while preserving important information.

Table: Dimensionality Reduction Techniques

TechniqueMethodUse Case
PCAPrincipal Component AnalysisLinear reduction
LDALinear Discriminant AnalysisClassification tasks
t-SNENon-linear embeddingVisualization
Feature SelectionSelect important featuresReduce overfitting

Feature Subset Selection Methods:

  • Filter Methods: Statistical tests, correlation analysis
  • Wrapper Methods: Forward/backward selection
  • Embedded Methods: LASSO, Ridge regression

Benefits:

  • Computational Efficiency: Faster training and prediction
  • Storage Reduction: Less memory requirements
  • Noise Reduction: Remove irrelevant features
  • Visualization: Enable 2D/3D plotting

Code Example:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)

Mnemonic: “Reduce Features, Improve Performance” - Fewer features often lead to better models

Question 3(a) [3 marks]
#

Does bias affect the performance of the ML model? Explain briefly.

Answer:

Yes, bias significantly affects ML model performance by creating systematic errors in predictions.

Table: Types of Bias

Bias TypeDescriptionImpact
Selection BiasNon-representative dataPoor generalization
Confirmation BiasFavoring expected resultsSkewed conclusions
Algorithmic BiasModel assumptionsUnfair predictions

Effects on Performance:

  • Underfitting: High bias leads to oversimplified models
  • Poor Accuracy: Systematic errors reduce overall performance
  • Unfair Decisions: Biased models discriminate against groups

Mitigation Strategies:

  • Diverse training data
  • Cross-validation techniques
  • Bias detection algorithms

Mnemonic: “Bias Breaks Better Performance” - Bias reduces model effectiveness

Question 3(b) [4 marks]
#

Compare cross-validation and bootstrap sampling

Answer:

Table: Cross-validation vs Bootstrap Sampling

AspectCross-validationBootstrap Sampling
MethodSplit data into foldsSample with replacement
Data UsageUses all dataCreates multiple samples
PurposeModel evaluationEstimate uncertainty
OverlapNo overlap between setsAllows duplicate samples

Cross-validation:

  • Divides data into k equal parts
  • Trains on k-1 parts, tests on 1 part
  • Repeats k times for robust evaluation

Bootstrap Sampling:

  • Creates random samples with replacement
  • Generates multiple datasets of same size
  • Estimates confidence intervals

Applications:

  • Cross-validation: Model selection, hyperparameter tuning
  • Bootstrap: Statistical inference, confidence estimation

Mnemonic: “Cross Checks, Bootstrap Builds” - Cross-validation checks performance, Bootstrap builds confidence

Question 3(c) [7 marks]
#

Confusion Matrix Calculation and Metrics

Answer:

Given Information:

  • True Positive (TP): 83 (predicted buy, actually bought)
  • False Positive (FP): 7 (predicted buy, didn’t buy)
  • False Negative (FN): 5 (predicted no buy, actually bought)
  • True Negative (TN): 5 (predicted no buy, didn’t buy)

Confusion Matrix:

Predicted BuyPredicted No Buy
Actually Buy83 (TP)5 (FN)
Actually No Buy7 (FP)5 (TN)

Calculations:

a) Error Rate: Error Rate = (FP + FN) / Total = (7 + 5) / 100 = 0.12 = 12%

b) Precision: Precision = TP / (TP + FP) = 83 / (83 + 7) = 83/90 = 0.922 = 92.2%

c) Recall: Recall = TP / (TP + FN) = 83 / (83 + 5) = 83/88 = 0.943 = 94.3%

d) F-measure: F-measure = 2 × (Precision × Recall) / (Precision + Recall) F-measure = 2 × (0.922 × 0.943) / (0.922 + 0.943) = 0.932 = 93.2%

Table: Performance Metrics

MetricValueInterpretation
Error Rate12%Model makes 12% wrong predictions
Precision92.2%92.2% of predicted buyers actually buy
Recall94.3%Model identifies 94.3% of actual buyers
F-measure93.2%Balanced performance measure

Mnemonic: “Perfect Recall Finds Everyone” - Precision measures accuracy, Recall finds all positives

Question 3(a) OR [3 marks]
#

Define in brief: a) Target function b) Cost function c) Loss Function

Answer:

Table: Function Definitions

FunctionDefinitionPurpose
Target FunctionIdeal mapping from input to outputWhat we want to learn
Cost FunctionMeasures overall model errorEvaluate total performance
Loss FunctionMeasures error for single predictionIndividual prediction error

Detailed Explanation:

  • Target Function: f(x) = y, the true relationship we want to approximate
  • Cost Function: Average of all loss functions, J = (1/n)Σloss(yi, ŷi)
  • Loss Function: Error for one sample, e.g., (yi - ŷi)²

Relationship: Cost function is typically the average of loss functions across all training examples.

Mnemonic: “Target Costs Less” - Target function is ideal, Cost function measures overall error, Loss function measures individual error

Question 3(b) OR [4 marks]
#

Explain balanced fit, underfit and overfit

Answer:

Table: Model Fitting Types

Fit TypeTraining ErrorValidation ErrorCharacteristics
UnderfitHighHighToo simple model
Balanced FitLowLowOptimal complexity
OverfitVery LowHighToo complex model

Visualization:

graph LR
    A[Underfit] --> B[Balanced Fit]
    B --> C[Overfit]
    A --> D[High Bias]
    C --> E[High Variance]
    B --> F[Optimal Performance]

Characteristics:

  • Underfit: Model too simple, cannot capture patterns
  • Balanced Fit: Right complexity, generalizes well
  • Overfit: Model too complex, memorizes training data

Solutions:

  • Underfit: Increase model complexity, add features
  • Overfit: Regularization, cross-validation, more data

Mnemonic: “Balance Brings Best Results” - Balanced models perform best on new data

Question 4(a) [3 marks]
#

Give classification learning steps.

Answer:

Table: Classification Learning Steps

StepDescriptionPurpose
Data CollectionGather labeled examplesProvide training material
PreprocessingClean and prepare dataImprove data quality
Feature SelectionChoose relevant attributesReduce complexity
Model TrainingLearn from training dataBuild classifier
EvaluationTest model performanceAssess accuracy
DeploymentUse for new predictionsPractical application

Detailed Process:

  1. Prepare dataset with input features and class labels
  2. Split data into training and testing sets
  3. Train classifier using training data
  4. Validate model using test data
  5. Fine-tune parameters for optimal performance

Mnemonic: “Data Preparation Facilitates Model Excellence” - Data prep, Feature selection, Model training, Evaluation

Question 4(b) [4 marks]
#

Linear Relationship Calculation

Answer:

Given Data:

Hours (X)Exam Score (Y)
285
380
475
570
660

Linear Regression Calculation:

Step 1: Calculate means

  • X̄ = (2+3+4+5+6)/5 = 4
  • Ȳ = (85+80+75+70+60)/5 = 74

Step 2: Calculate slope (b)

  • Numerator = Σ(X-X̄)(Y-Ȳ) = (2-4)(85-74) + (3-4)(80-74) + (4-4)(75-74) + (5-4)(70-74) + (6-4)(60-74)
  • = (-2)(11) + (-1)(6) + (0)(1) + (1)(-4) + (2)(-14) = -22 - 6 + 0 - 4 - 28 = -60
  • Denominator = Σ(X-X̄)² = (-2)² + (-1)² + (0)² + (1)² + (2)² = 4 + 1 + 0 + 1 + 4 = 10
  • b = -60/10 = -6

Step 3: Calculate intercept (a)

  • a = Ȳ - b×X̄ = 74 - (-6)×4 = 74 + 24 = 98

Linear Equation: Y = 98 - 6X

Interpretation: For every additional hour of smartphone use, exam score decreases by 6 points.

Mnemonic: “More Phone, Less Score” - Negative correlation between phone use and grades

Question 4(c) [7 marks]
#

Explain classification steps in detail

Answer:

Classification is a supervised learning process that assigns input data to predefined categories or classes.

Detailed Classification Steps:

1. Problem Definition

  • Define classes and objectives
  • Identify input features and target variable
  • Determine success criteria

2. Data Collection and Preparation

flowchart TD
    A[Raw Data] --> B[Data Cleaning]
    B --> C[Handle Missing Values]
    C --> D[Remove Outliers]
    D --> E[Feature Engineering]
    E --> F[Data Splitting]

3. Feature Engineering

  • Feature Selection: Choose relevant attributes
  • Feature Extraction: Create new meaningful features
  • Normalization: Scale features to similar ranges

4. Model Selection and Training

Table: Common Classification Algorithms

AlgorithmBest ForAdvantages
Decision TreeInterpretable rulesEasy to understand
SVMHigh-dimensional dataGood generalization
Neural NetworksComplex patternsHigh accuracy
Naive BayesText classificationFast training

5. Model Evaluation

  • Confusion Matrix: Detailed performance analysis
  • Cross-validation: Robust performance estimation
  • Metrics: Accuracy, Precision, Recall, F1-score

6. Hyperparameter Tuning

  • Grid search for optimal parameters
  • Validation set for parameter selection

7. Final Evaluation and Deployment

  • Test on unseen data
  • Deploy model for production use
  • Monitor performance over time

Mnemonic: “Proper Data Modeling Evaluates Performance Thoroughly” - Problem definition, Data prep, Modeling, Evaluation, Performance testing, Tuning

Question 4(a) OR [3 marks]
#

Does the choice of the k value influence the performance of the KNN algorithm? Explain briefly

Answer:

Yes, the k value significantly influences KNN algorithm performance by affecting the decision boundary and model complexity.

Table: K Value Impact

K ValueEffectPerformance
Small K (k=1)Sensitive to noiseHigh variance, low bias
Medium KBalanced decisionsOptimal performance
Large KSmooth boundariesLow variance, high bias

Impact Analysis:

  • k=1: May overfit to training data, sensitive to outliers
  • Optimal k: Usually odd number, balances bias-variance tradeoff
  • Large k: May underfit, loses local patterns

Selection Strategy:

  • Use cross-validation to find optimal k
  • Try k = √n as starting point
  • Consider computational cost vs accuracy

Mnemonic: “Small K Varies, Large K Smooths” - Small k creates variance, large k creates smooth boundaries

Question 4(b) OR [4 marks]
#

Define Support Vectors in the SVM model.

Answer:

Support Vectors are the critical data points that lie closest to the decision boundary (hyperplane) in Support Vector Machine algorithm.

Table: Support Vector Characteristics

AspectDescriptionImportance
LocationClosest points to hyperplaneDefine decision boundary
DistanceEqual distance from boundaryMaximize margin
RoleSupport the hyperplaneDetermine optimal separation
SensitivityRemoving them changes modelCritical for model structure

Key Properties:

  • Margin Definition: Support vectors determine the maximum margin between classes
  • Model Dependency: Only support vectors affect the final model
  • Boundary Formation: Create the optimal separating hyperplane

Diagram:

ClasSHsuyppApeOorrptlaVneec:torXsC:lxxaOsxxxsanBdX

Mathematical Significance: Support vectors satisfy the constraint yi(w·xi + b) = 1, where they lie exactly on the margin boundary.

Mnemonic: “Support Vectors Support Decisions” - These vectors support the decision boundary

Question 4(c) OR [7 marks]
#

Explain logistic regression in detail.

Answer:

Logistic Regression is a statistical method used for binary classification that models the probability of class membership using the logistic function.

Mathematical Foundation:

Sigmoid Function: σ(z) = 1 / (1 + e^(-z)) where z = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ

Table: Linear vs Logistic Regression

AspectLinear RegressionLogistic Regression
OutputContinuous valuesProbabilities (0-1)
FunctionLinearSigmoid (S-curve)
PurposePredictionClassification
Error FunctionMean Squared ErrorLog-likelihood

Key Components:

1. Logistic Function Properties:

  • S-shaped curve: Smooth transition between 0 and 1
  • Asymptotes: Approaches 0 and 1 but never reaches them
  • Monotonic: Always increasing function

2. Model Training:

  • Maximum Likelihood Estimation: Find parameters that maximize probability of observed data
  • Gradient Descent: Iterative optimization algorithm
  • Cost Function: Log-loss or cross-entropy

3. Decision Making:

  • Threshold: Typically 0.5 for binary classification
  • Probability Output: P(y=1|x) gives class probability
  • Decision Rule: Classify as positive if P(y=1|x) > 0.5

Advantages:

  • Probabilistic Output: Provides confidence in predictions
  • No Assumptions: About distribution of independent variables
  • Less Overfitting: Compared to complex models
  • Fast Training: Efficient computation

Applications:

  • Medical diagnosis
  • Marketing response prediction
  • Credit approval decisions
  • Email spam detection

Code Example:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

Mnemonic: “Sigmoid Squashes Infinite Input” - Sigmoid function converts any real number to probability

Question 5(a) [3 marks]
#

Write a short note on Matplotlib python library.

Answer:

Matplotlib is a comprehensive Python library for creating static, animated, and interactive visualizations in data science and machine learning.

Table: Matplotlib Key Features

FeaturePurposeExample
PyplotMATLAB-like plotting interfaceLine plots, scatter plots
Object-orientedAdvanced customizationFigure and axes objects
Multiple formatsSave in various formatsPNG, PDF, SVG, EPS
SubplotsMultiple plots in one figureGrid arrangements

Common Plot Types:

  • Line Plot: Trends over time
  • Scatter Plot: Relationship between variables
  • Histogram: Data distribution
  • Bar Chart: Categorical comparisons
  • Box Plot: Statistical summaries

Basic Usage:

import matplotlib.pyplot as plt
plt.plot(x, y)
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.title('Plot Title')
plt.show()

Applications: Data exploration, model performance visualization, presentation graphics

Mnemonic: “Matplotlib Makes Pretty Plots” - Essential tool for data visualization

Question 5(b) [4 marks]
#

K-means clustering for two-dimensional data

Answer:

Given Points: {(2,3),(3,3),(4,3),(5,3),(6,3),(7,3),(8,3),(25,20),(26,20),(27,20),(28,20),(29,20),(30,20)}

K-means Algorithm Steps:

Step 1: Initialize centroids

  • Cluster 1: (4, 3) - chosen from left group
  • Cluster 2: (27, 20) - chosen from right group

Step 2: Assign points to nearest centroid

Table: Point Assignments

PointDistance to C1Distance to C2Assigned Cluster
(2,3)2.025.8Cluster 1
(3,3)1.024.8Cluster 1
(4,3)0.023.8Cluster 1
(5,3)1.022.8Cluster 1
(6,3)2.021.8Cluster 1
(7,3)3.020.8Cluster 1
(8,3)4.019.8Cluster 1
(25,20)23.82.0Cluster 2
(26,20)24.81.0Cluster 2
(27,20)25.80.0Cluster 2
(28,20)26.81.0Cluster 2
(29,20)27.82.0Cluster 2
(30,20)28.83.0Cluster 2

Step 3: Update centroids

  • New C1 = ((2+3+4+5+6+7+8)/7, (3+3+3+3+3+3+3)/7) = (5, 3)
  • New C2 = ((25+26+27+28+29+30)/6, (20+20+20+20+20+20)/6) = (27.5, 20)

Final Clusters:

  • Cluster 1: {(2,3),(3,3),(4,3),(5,3),(6,3),(7,3),(8,3)}
  • Cluster 2: {(25,20),(26,20),(27,20),(28,20),(29,20),(30,20)}

Mnemonic: “Centroids Attract Nearest Neighbors” - Points join closest centroid

Question 5(c) [7 marks]
#

Give functions and its use of Scikit-learn for: a. Data Preprocessing b. Model Selection c. Model Evaluation and Metrics

Answer:

Scikit-learn provides comprehensive tools for machine learning workflow from data preprocessing to model evaluation.

a) Data Preprocessing Functions:

Table: Preprocessing Functions

FunctionPurposeExample Usage
StandardScaler()Normalize featuresRemove mean, unit variance
MinMaxScaler()Scale to range [0,1]Feature scaling
LabelEncoder()Encode categorical labelsConvert text to numbers
OneHotEncoder()Create dummy variablesHandle categorical features
train_test_split()Split datasetTraining/testing division

Code Example:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

b) Model Selection Functions:

Table: Model Selection Tools

FunctionPurposeApplication
GridSearchCV()Hyperparameter tuningFind optimal parameters
RandomizedSearchCV()Random parameter searchFaster parameter optimization
cross_val_score()Cross-validationModel performance evaluation
StratifiedKFold()Stratified samplingBalanced cross-validation
Pipeline()Combine preprocessing and modelingStreamlined workflow

Code Example:

from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10]}
grid_search = GridSearchCV(SVM(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

c) Model Evaluation and Metrics Functions:

Table: Evaluation Metrics

FunctionPurposeUse Case
accuracy_score()Overall accuracyGeneral classification
precision_score()Positive prediction accuracyMinimize false positives
recall_score()True positive rateMinimize false negatives
f1_score()Harmonic mean of precision/recallBalanced metric
confusion_matrix()Detailed error analysisUnderstanding mistakes
classification_report()Comprehensive metricsComplete evaluation
roc_auc_score()Area under ROC curveBinary classification

Code Example:

from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred))

Workflow Integration:

  • Preprocessing: Clean and prepare data
  • Model Selection: Choose and tune algorithms
  • Evaluation: Assess performance comprehensively

Mnemonic: “Preprocess, Select, Evaluate” - Complete ML workflow in Scikit-learn

Question 5(a) OR [3 marks]
#

List out the major features of Numpy.

Answer:

NumPy (Numerical Python) is the fundamental package for scientific computing in Python, providing powerful array operations and mathematical functions.

Table: Major NumPy Features

FeatureDescriptionBenefit
N-dimensional ArraysEfficient array objectsFast mathematical operations
BroadcastingOperations on different sized arraysFlexible computations
Linear AlgebraMatrix operations, decompositionsScientific computing
Random NumbersRandom sampling and distributionsStatistical simulations
IntegrationWorks with C/C++/FortranHigh performance

Key Capabilities:

  • Mathematical Functions: Trigonometric, logarithmic, exponential
  • Array Manipulation: Reshaping, splitting, joining arrays
  • Indexing: Advanced slicing and boolean indexing
  • Memory Efficiency: Optimized data storage

Applications: Data analysis, machine learning, image processing, scientific research

Mnemonic: “Numbers Need Numpy’s Power” - Essential for numerical computations

Question 5(b) OR [4 marks]
#

K-means clustering for one-dimensional data

Answer:

Given Dataset: {1,2,4,5,7,8,10,11,12,14,15,17}

K-means Algorithm for 3 clusters:

Step 1: Initialize centroids

  • C1 = 3 (around early values)
  • C2 = 9 (around middle values)
  • C3 = 15 (around later values)

Step 2: Assign points to nearest centroid

Table: Point Assignments (Iteration 1)

PointDistance to C1Distance to C2Distance to C3Assigned Cluster
12814Cluster 1
21713Cluster 1
41511Cluster 1
52410Cluster 1
7428Cluster 2
8517Cluster 2
10715Cluster 2
11824Cluster 2
12933Cluster 2
141151Cluster 3
151260Cluster 3
171482Cluster 3

Step 3: Update centroids

  • New C1 = (1+2+4+5)/4 = 3
  • New C2 = (7+8+10+11+12)/5 = 9.6
  • New C3 = (14+15+17)/3 = 15.33

Final Clusters:

  • Cluster 1: {1, 2, 4, 5}
  • Cluster 2: {7, 8, 10, 11, 12}
  • Cluster 3: {14, 15, 17}

Mnemonic: “Groups Gather by Distance” - Similar points form natural clusters

Question 5(c) OR [7 marks]
#

Give function and its use of Pandas library for: a. Data Preprocessing b. Data Inspection c. Data Cleaning and Transformation

Answer:

Pandas is a powerful Python library for data manipulation and analysis, providing high-level data structures and operations.

a) Data Preprocessing Functions:

Table: Preprocessing Functions

FunctionPurposeExample
read_csv()Load CSV filespd.read_csv('data.csv')
head()View first n rowsdf.head(10)
tail()View last n rowsdf.tail(5)
sample()Random samplingdf.sample(100)
set_index()Set column as indexdf.set_index('id')

b) Data Inspection Functions:

Table: Inspection Functions

FunctionPurposeInformation Provided
info()Dataset overviewData types, memory usage
describe()Statistical summaryMean, std, min, max
shapeDataset dimensions(rows, columns)
dtypesData typesColumn data types
isnull()Missing valuesBoolean mask for nulls
value_counts()Count unique valuesFrequency distribution
corr()Correlation matrixFeature relationships

Code Example:

# Data inspection
print(df.info())
print(df.describe())
print(df.isnull().sum())

c) Data Cleaning and Transformation Functions:

Table: Cleaning Functions

FunctionPurposeUsage
dropna()Remove missing valuesdf.dropna()
fillna()Fill missing valuesdf.fillna(0)
drop_duplicates()Remove duplicate rowsdf.drop_duplicates()
replace()Replace valuesdf.replace('old', 'new')
astype()Change data typesdf['col'].astype('int')
apply()Apply function to datadf.apply(lambda x: x*2)
groupby()Group datadf.groupby('category')
merge()Join datasetspd.merge(df1, df2)
pivot()Reshape datadf.pivot(columns='col')

Advanced Operations:

  • String Operations: str.contains(), str.replace()
  • Date Operations: to_datetime(), dt.year
  • Categorical Data: pd.Categorical()

Workflow Example:

# Complete preprocessing pipeline
df = pd.read_csv('data.csv')
df = df.dropna()
df['category'] = df['category'].astype('category')
df_grouped = df.groupby('type').mean()

Benefits:

  • Intuitive Syntax: Easy to learn and use
  • Performance: Optimized for large datasets
  • Integration: Works well with NumPy, Matplotlib
  • Flexibility: Handles various data formats

Mnemonic: “Pandas Processes Data Perfectly” - Comprehensive data manipulation tool

Related

Fundamentals of Machine Learning (4341603) - Summer 2023 Solution
Study-Material Solutions Machine-Learning 4341603 2023 Summer
Fundamentals of Electrical Engineering (4311101) - Summer 2024 Solution
18 mins
Study-Material Solutions Electrical-Engineering 4311101 2024 Summer
Java Programming (4343203) - Summer 2024 Solution
17 mins
Study-Material Solutions Java-Programming 4343203 2024 Summer
Linear Integrated Circuit (4341105) - Summer 2024 Solution
20 mins
Study-Material Solutions Linear-Integrated-Circuit 4341105 2024 Summer
Database Management System (1333204) - Summer 2024 Solution
20 mins
Study-Material Solutions Database 1333204 2024 Summer
Fundamentals of Software Development (4331604) - Summer 2024 Solution
Study-Material Solutions Software-Development 4331604 2024 Summer