Code

[09/11, 7:29 am] Reva Chandan: import numpy as np

import pandas as pd

df = pd.read_csv('Salary.csv')

print(df.to_string())

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

df.head()

X = df.iloc[:, :-1].values # Features => Years of experience => Independent Variable

y = df.iloc[:, -1].values # Target => Salary => Dependent Variable

# divide the dataset in some amount of training and testing data

from sklearn.model_selection import train_test_split

import sklearn.metrics as sm

# random_state => seed value used by random number generator

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

predictions = model.predict(X_test)

predictions

y_test

import seaborn as sns

sns.distplot(predictions-y_test)

plt.scatter(X_train, y_train, color='red')

plt.plot(X_train, model.predict(X_train))

r_sq = model.score(X_train, y_train)

print('coefficient of determination:', r_sq)

# Print the Intercept:

print('intercept:', model.intercept_)

# Print the Slope:

print('slope:', model.coef_)

# Predict a Response and print it:

y_pred = model.predict(X_train)

print('Predicted response:', y_pred)

print('y='+str(float(model.coef_))+'X+'+str(float(model.intercept_)))

[09/11, 7:29 am] Reva Chandan: # Import required libraries and read data into a dataframe

import pandas as pd

df = pd.read_csv('wine-clustering.csv')

df.head()

# Visual data exploration to identify correlation among columns of data

import seaborn as sns

sns.pairplot(df)

#Import algorithms from sklearn

from sklearn.cluster import KMeans

#the columns we will use for clustering are only two - the OD280 and Alcohol content of wines

selected_features = df[['OD280', 'Alcohol']]

# The random_state needs to be the same number to get reproducible results

kmeans_obj = KMeans(n_clusters=3, random_state=42)

# Fit the Kmeans algorithm on selected columns

kmeans_obj.fit(selected_features)

# Predict the cluster labels for data

y_kmeans = kmeans.fit_predict(selected_features)

#Print the predicted labels

print(y_kmeans)

# Printing the cluster centers

centers = kmeans.cluster_centers_

print(centers)

#Visualize the Groups created

sns.scatterplot(x = selected_features['OD280'], y = selected_features['Alcohol'], hue=kmeans_obj.labels_)

#Visualize the cluster centroids

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='red')

[09/11, 7:29 am] Reva Chandan: #### Machine Learning Lab Program 3

Aim : Write a python program to classify the medical dataset using K Nearest Neighbor Algorithm. The students are expected to demonstrate how you can perform basic data processing operations, split the dataset into training and test sets, train the model, score the test dataset, and evaluate the predictions.

Description: In this program, you will use Breast Cancer Wisconsin dataset (originally from UCI Machine Learning Repository) to train a K-nearest neighbor model. This model will be used to classify the test data into one of the two classes - benign or malignant. i.e. you will predict the diagnosis: B = benign, M = malignant

DataSet: The Breast Cancer Wisconsin dataset from UCI machine learning repository is a classification dataset. It contains a total of 32 columns, first column is patient id, second column is the diagnosis - "B" for Benign , "M" for Malignant. Remaining 30 columns each represent the features that are the measurements for breast cancer patients. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Dataset Description:

https://data.world/health/breast-cancer-wisconsin/workspace/file?filename=DatasetDescription.txt

Dataset url:

https://data.world/health/breast-cancer-wisconsin/workspace/file?filename=breast-cancer-wisconsin-data%2Fdata.csv

# import required libraries

import numpy as np

import pandas as pd

#if you have downloaded the dataset to you local computer you can use this syntax to read the file into your pandas dataframe

data = pd.read_csv("breast-cancer-wisconsin-data_data.csv")

data.head()

data.columns

data = data.drop(['id', 'Unnamed: 32'], axis = 1)

data.shape

data.describe()

data.info()

data.columns

# Extract the columns that will be the features and the target variable (diagnosis) in X and y respectively

X = data.loc[:, ['radius_mean', 'texture_mean', 'perimeter_mean',

'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',

'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',

'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',

'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',

'fractal_dimension_se', 'radius_worst', 'texture_worst',

'perimeter_worst', 'area_worst', 'smoothness_worst',

'compactness_worst', 'concavity_worst', 'concave points_worst',

'symmetry_worst', 'fractal_dimension_worst']]

y = data.loc[:, 'diagnosis']

X.head()

y.head()

#Train - Test Split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Fit the KNN Classifier

from sklearn.neighbors import KNeighborsClassifier

knn_cfr = KNeighborsClassifier(n_neighbors=3)

knn_cfr.fit(X_train, y_train)

# Use the fitted model to make prediction for test data

y_pred = knn_cfr.predict(X_test)

# Print the accuracy score of your model

# accuracy = Number of correct predictions / total number of predictions

# accuracy_score = the number of test samples for which y_pred == y_test

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

# This model has an accuracy of 94.14 %

Machine Learning Lab Program 4

Aim : Predict the real estate sales price of a house based upon various quantitative features about the house and sale.

Implementation: Demonstrate basic data processing operations, split dataset into training and test sets, train the model, score the test dataset and evaluate the predictions.

DataSet: Dataset containing house features and prices

Dataset url: https://data.world/swarnapuri-sude/house-data/workspace/file?filename=kc_house_data.csv

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

data = pd.read_csv("kc_house_data.csv")

data.head()

data = data.drop(["id", "date"], axis = 1)

data.head()

data.describe()

data['bedrooms'].value_counts().plot(kind='bar')

plt.title('number of Bedroom')

plt.xlabel('Bedrooms')

plt.ylabel('Count')

sns.despine()

plt.figure(figsize=(10,10))

sns.jointplot(x=data.lat.values, y=data.long.values, height=10)

plt.ylabel('Longitude',fontsize=12)

plt.xlabel('Latitude',fontsize=12)

sns.despine()

plt.show()

#Visualizing common factors are affecting the price of the houses

plt.scatter(data.price,data.sqft_living)

plt.title("Price vs Square Feet")

plt.scatter(data.price,data.long)

plt.title("Price vs Location of the area")

plt.scatter(data.price,data.lat)

plt.xlabel("Price")

plt.ylabel('Latitude')

plt.title("Latitude vs Price")

plt.scatter(data.bedrooms,data.price)

plt.title("Bedroom and Price ")

plt.xlabel("Bedrooms")

plt.ylabel("Price")

sns.despine()

plt.show()

plt.scatter((data['sqft_living']+data['sqft_basement']),data['price'])

plt.scatter(data.waterfront,data.price)

plt.title("Waterfront vs Price ( 0= no waterfront)")

#Extracting X features and y label

y = data['price']

X = data.drop(['price'],axis=1)

from sklearn.model_selection import train_test_split

x_train , x_test , y_train , y_test = train_test_split(X , y , test_size = 0.10,random_state =2)

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

reg.fit(x_train,y_train)

reg.score(x_test,y_test)

## Prog 5: Write a python program to predict income levels of adult individuals using Decision Tree Model. The process includes training, testing and evaluating the model on the Adult dataset.

In this experiment you need to train a classifier on the "adult" dataset, and predict whether an individual’s income is greater or less than $50,000. Perform basic data processing operations, split the dataset into training and test sets, train the model, score the test dataset, and evaluate the predictions.

Dataset: The Adult dataset is from the Census Bureau and the task is to predict whether a given adult earns more than $50,000 a year or not based attributes such as education, hours of work per week, etc..

URL: https://www.kaggle.com/datasets/wenruliu/adult-income-dataset/download?datasetVersionNumber=2

It has a total of 15 columns,

Target Column is "Income", The income is divide into two classes: <=50K and >50K

Number of attributes: 14, These are the demographics and other features to describe a person

14 attributes are:

- Age.

- Workclass.

- Final Weight.

- Education.

- Education Number of Years.

- Marital-status.

- Occupation.

- Relationship.

- Race.

- Gender.

- Capital-gain.

- Capital-loss.

- Hours-per-week.

- Native-country.

The dataset contains missing values that are marked with a question mark character (?).

There are a total of 48,842 rows of data, and 3,620 with missing values, leaving 45,222 complete rows.

There are two class values ‘>50K‘ and ‘<=50K‘, i.e., it is a binary classification task.

#Required imports

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

#Read dataset

df = pd.read_csv("adult.csv")

df.head()

df.columns

df.shape

# See the columns that contain a "?" and how many "?" are there in those columns

df.isin(['?']).sum()

#Replace ? with NaN

df['workclass'] = df['workclass'].replace('?', np.nan)

df['occupation'] = df['occupation'].replace('?', np.nan)

df['native-country'] = df['native-country'].replace('?', np.nan)

#Now the ? has been replaced by NaN, so count of ? is 0

df.isin(['?']).sum()

#Check missing values - NaN values

df.isnull().sum()

#Drop all rows that contain a missing value

df.dropna(how='any', inplace=True)

#Check duplicate values in dataframe now

print(f"There are {df.duplicated().sum()} duplicate values")

df = df.drop_duplicates()

df.shape

df.columns

#Drop non-relevant columns

df.drop(['fnlwgt','educational-num','marital-status','relationship', 'race',], axis = 1, inplace = True)

df.columns

#Extract X and y from the dataframe , income column is the target column, rest columns are features

X = df.loc[:,['age', 'workclass', 'education', 'occupation', 'gender', 'capital-gain',

'capital-loss', 'hours-per-week', 'native-country']]

y = df.loc[:,'income']

X.head()

y.head()

# Since y is a binary categorical column we will use label encoder to convert it into numerical columns with values 0 and 1

from sklearn.preprocessing import LabelEncoder

y = LabelEncoder().fit_transform(y)

y = pd.DataFrame(y)

y.head()

#First identify caterogical features and numeric features

numeric_features = X.select_dtypes('number')

categorical_features = X.select_dtypes('object')

categorical_features

numeric_features

#Convert categorical features into numeric

converted_categorical_features = pd.get_dummies(categorical_features)

converted_categorical_features.shape

#combine the converted categorical features and the numeric features together into a new dataframe called "newX"

all_features = [converted_categorical_features, numeric_features]

newX = pd.concat(all_features,axis=1, join='inner')

newX.shape

newX.columns

#Do a train test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(newX, y, test_size=0.33, random_state=42)

# Load Decision Tree Classifier, max_depth = 5 and fit it with X-train and y-train

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=5)

clf.fit(X_train, y_train)

# Make predictions

y_pred = clf.predict(X_test)

y_test.shape

y_pred.shape

predictions_df = pd.DataFrame()

predictions_df['precdicted_salary_class'] = y_pred

predictions_df['actual_salary_class'] = y_test[0].values

predictions_df

#Evaluate the performance of fitting

from sklearn.metrics import accuracy_score

print(accuracy_score(y_pred,y_test))

#Plot your decision tree

from sklearn.tree import plot_tree

import matplotlib.pyplot as plt

plt.figure(figsize=(14,14))

plot_tree(clf, fontsize=10, filled=True)

plt.title("Decision tree trained on the selected features")

plt.show()

Search This Blog

SPYDER TECH

Code

Comments

Post a Comment

Popular posts from this blog

U-TORRENT PRO ACTIVATED 100% FREE