Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

MACHINE LEARNING ADVANCED CERTIFICATION | JAN 9 2021 | Wajahat

_83549

New Member
Hi Wajahat,
Please find below the assignment pdf.
 

Attachments

  • LessonEndProject.pdf
    20.3 KB · Views: 2
  • LessonEndProject.pdf
    20.3 KB · Views: 0

_86766

Member
Hi, Pls find the attachment

Regards,
Laxmisha B V
 

Attachments

  • L3_Practice6_Data_Manipulation_NA_SA.zip
    43.8 KB · Views: 2
  • L3_Practice8_Lesson_End.zip
    44 KB · Views: 2
  • L3_Practice7_Data_Manipulation_Salaries.zip
    66.4 KB · Views: 1
  • L4_Practice2_LinRegression_Boston.zip
    47.6 KB · Views: 3
  • L4_Practice4_Ridge_Lasso_Boston.zip
    49.1 KB · Views: 2
  • L4_Practice6_LogRegression_Iris.zip
    45.8 KB · Views: 2
  • L7_Practice5_K-Means_Clustering_Dog_Pic.zip
    312.4 KB · Views: 1
Last edited:
Uploading a revised version after attempting to solve during break
 

Attachments

  • sk_project.ipynb.zip
    1.5 KB · Views: 1
  • sudha_project_revised.ipynb.zip
    2.5 KB · Views: 1
Last edited:

_79119

Member
Attached the Lesson 2 project
 

Attachments

  • LessonEndProject_day2.pdf
    17.9 KB · Views: 0
  • LessonEndProject_day2.pdf
    17.9 KB · Views: 0

_85417

New Member
Hi,
I am uploading *.ipynb as the conversion to pdf not working.

Thanks
Ravichandran
 

Attachments

  • LessonEndProject.zip
    2.1 KB · Views: 0
Hi,

Pls find the assignmnet 1 coding:-

import pandas as pd
rd={'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, ".", "."],'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df1=pd.DataFrame(rd)
print(df1)
df1.to_csv("project.csv",index=False)
df1project=pd.read_csv("project.csv",header=None)
df1project.head()


first_name last_name age preTestScore postTestScore
0 Jason Miller 42 4 25,000
1 Molly Jacobson 52 24 94,000
2 Tina . 36 31 57
3 Jake Milner 24 . 62
4 Amy Cooze 73 . 70

Out[27]:
0 1 2 3 4
0 first_name last_name age preTestScore postTestScore
1 Jason Miller 42 4 25,000
2 Molly Jacobson 52 24 94,000
3 Tina . 36 31 57
4 Jake Milner 24 . 62

In [21]:
df1project=df1project.rename(columns={'first_name': 'First Name','last_name':'Last Name'})
df1project.keys()

Out[21]:
Index(['First Name', 'Last Name', 'age', 'preTestScore', 'postTestScore'], dtype='object')
In [13]:
df1project.isna()

Out[13]:
First Name Last Name age preTestScore postTestScore
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False

In [28]:
df1project.iloc[4:,]

Out[28]:
0 1 2 3 4
4 Jake Milner 24 . 62
5 Amy Cooze 73 . 70


by
S.Nagarajan
 

_90819

Member
Please find attached the Class Project
Regards,
Gurjit Bakshi
 

Attachments

  • Practice ML - Jupyter Notebook.pdf
    237.4 KB · Views: 5

_79223

Member
Hi Wajahat
Please find the PDFs attached.

Regards
Ashwini Thanthri
 

Attachments

  • Lesson2_Concate_North & South America.pdf
    111.7 KB · Views: 3
  • Lesson2_End project.pdf
    202 KB · Views: 3
  • Lesson2_Salary data_employees.pdf
    155.5 KB · Views: 6

_90819

Member
Please find attached the Homework Assignment.
Regards,
Gurjit Bakshi
 

Attachments

  • Salaries Data - Jupyter Notebook.pdf
    120.7 KB · Views: 6
  • South America and North America data - Jupyter Notebook.pdf
    172.3 KB · Views: 5

_79801

Member
Hi,

PFA assignment.

Thanks & Regards,
Karthick R
 

Attachments

  • Concatenation_Salary_Analysis_Assignment.zip
    95.1 KB · Views: 3

_89624

Active Member
Hello Wajahat,
Good day!
Pls find attached assignment on demo of concatenation.
Regards
Aiysha
 

Attachments

  • noramerica_southamerica_concatenation_demo.zip
    2.5 KB · Views: 1

_89624

Active Member
Re attaching the PDF format

Hello Wajahat,

Pls find attached assignment on demo of concatenation.

Regards
Aiysha
 

Attachments

  • noramerica_southamerica_concatenation_demo.slides.pdf
    40.7 KB · Views: 2
Hi,

Attaching files for class assignment - 10th Jan 2021
Weekdays assignment .
I was not able to do the maximum "maximum salary for the employee across years" problem. i will try it again.

Thanks,
Abhishek
 

Attachments

  • Assignments.zip
    113.9 KB · Views: 1

Abhinav Ranjan

Member
Alumni
Hi,

Attached please find the self-assist assignment from 10-01-2021 class.


Thanks & Regards,
Abhinav Ranjan
 

Attachments

  • salary_&_country_data_assignment_10-01-2021.pdf
    650.6 KB · Views: 8

_81437

New Member
Hello Wajahat,

Pls find attached assignment on ClassWork, Manipulation , DataExploration and concatenation

With Regards,
Kumar Rohit
 

Attachments

  • TotalAssignment_4.zip
    22.6 KB · Views: 0
  • Concat Feature.zip
    1.6 KB · Views: 0
Last edited:
Machine Learning


10th January, 2021


Lesson 1: Introduction to Machine Learning (Theory)

Lesson 2: Data Wrangling and Manipulation

Loading .csv File in Python
In [70]:
import pandas # import the pandas library
import numpy as np #import the numpy library
df = pandas.read_csv("C:/Users/ns45237/working_neeraj/ml_datasets/mtcars.csv")


Loading Data to .csv File
In [9]:
df.to_csv("C:/Users/ns45237/working_neeraj/ml_datasets/mtcars.csv")


Loading .xlsx File in Python
In [14]:
df = pandas.read_excel("C:/Users/ns45237/working_neeraj/ml_datasets/mtcars.xlsx")


Loading Data to .xlsx File
In [15]:
df.to_excel("C:/Users/ns45237/working_neeraj/ml_datasets/mtcars.xlsx")

In [16]:
df.shape #Dimensionality Check

Out[16]:
(32, 13)
In [19]:
type(df) #Type of Dataset

Out[19]:
pandas.core.frame.DataFrame
In [21]:
df['cyl'].dtype #Type of column in a dataframe

Out[21]:
dtype('int64')
In [22]:
#slicing a list
list = [1,2,3,4,5]
list[1:3]

Out[22]:
[2, 3]
In [23]:
#using iloc indexer
df.iloc[:,1:3]

Out[23]:
model mpg
0 Mazda RX4 21.0
1 Mazda RX4 Wag 21.0
2 Datsun 710 22.8
3 Hornet 4 Drive 21.4
4 Hornet Sportabout 18.7
5 Valiant 18.1
6 Duster 360 14.3
7 Merc 240D 24.4
8 Merc 230 22.8
9 Merc 280 19.2
10 Merc 280C 17.8
11 Merc 450SE 16.4
12 Merc 450SL 17.3
13 Merc 450SLC 15.2
14 Cadillac Fleetwood 10.4
15 Lincoln Continental 10.4
16 Chrysler Imperial 14.7
17 Fiat 128 32.4
18 Honda Civic 30.4
19 Toyota Corolla 33.9
20 Toyota Corona 21.5
21 Dodge Challenger 15.5
22 AMC Javelin 15.2
23 Camaro Z28 13.3
24 Pontiac Firebird 19.2
25 Fiat X1-9 27.3
26 Porsche 914-2 26.0
27 Lotus Europa 30.4
28 Ford Pantera L 15.8
29 Ferrari Dino 19.7
30 Maserati Bora 15.0
31 Volvo 142E 21.4

In [25]:
#Get Unique values of the column
df['model'].unique()

Out[25]:
array(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive',
'Hornet Sportabout', 'Valiant', 'Duster 360', 'Merc 240D',
'Merc 230', 'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL',
'Merc 450SLC', 'Cadillac Fleetwood', 'Lincoln Continental',
'Chrysler Imperial', 'Fiat 128', 'Honda Civic', 'Toyota Corolla',
'Toyota Corona', 'Dodge Challenger', 'AMC Javelin', 'Camaro Z28',
'Pontiac Firebird', 'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa',
'Ford Pantera L', 'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'],
dtype=object)
In [28]:
#Return all Values of the column
df['model'].values

Out[28]:
array(['Mazda RX4', 'Mazda RX4 Wag', 'Datsun 710', 'Hornet 4 Drive',
'Hornet Sportabout', 'Valiant', 'Duster 360', 'Merc 240D',
'Merc 230', 'Merc 280', 'Merc 280C', 'Merc 450SE', 'Merc 450SL',
'Merc 450SLC', 'Cadillac Fleetwood', 'Lincoln Continental',
'Chrysler Imperial', 'Fiat 128', 'Honda Civic', 'Toyota Corolla',
'Toyota Corona', 'Dodge Challenger', 'AMC Javelin', 'Camaro Z28',
'Pontiac Firebird', 'Fiat X1-9', 'Porsche 914-2', 'Lotus Europa',
'Ford Pantera L', 'Ferrari Dino', 'Maserati Bora', 'Volvo 142E'],
dtype=object)
In [31]:
#Return mean of the dataframe across all columns
df.mean()

Out[31]:
Unnamed: 0 15.500000
mpg 20.090625
cyl 6.187500
disp 230.721875
hp 146.687500
drat 3.596563
wt 3.217250
qsec 17.848750
vs 0.437500
am 0.406250
gear 3.687500
carb 2.812500
dtype: float64
In [32]:
#Return median of the dataframe across all columns
df.median()

Out[32]:
Unnamed: 0 15.500
mpg 19.200
cyl 6.000
disp 196.300
hp 123.000
drat 3.695
wt 3.325
qsec 17.710
vs 0.000
am 0.000
gear 4.000
carb 2.000
dtype: float64
In [33]:
#Using mode( ) on the data frame will return mode values of the data frame
#across all the columns, rows with axis=0 and axis = 1, respectively.
df.mode(axis=0)

Out[33]:
Unnamed: 0 model mpg cyl disp hp drat wt qsec vs am gear carb
0 0 AMC Javelin 10.4 8.0 275.8 110.0 3.07 3.44 17.02 0.0 0.0 3.0 2.0
1 1 Cadillac Fleetwood 15.2 NaN NaN 175.0 3.92 NaN 18.90 NaN NaN NaN 4.0
2 2 Camaro Z28 19.2 NaN NaN 180.0 NaN NaN NaN NaN NaN NaN NaN
3 3 Chrysler Imperial 21.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4 Datsun 710 21.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 5 Dodge Challenger 22.8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 6 Duster 360 30.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 7 Ferrari Dino NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 8 Fiat 128 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 9 Fiat X1-9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 10 Ford Pantera L NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 11 Honda Civic NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 12 Hornet 4 Drive NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 13 Hornet Sportabout NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 14 Lincoln Continental NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 15 Lotus Europa NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 16 Maserati Bora NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 17 Mazda RX4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 18 Mazda RX4 Wag NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 19 Merc 230 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 20 Merc 240D NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 21 Merc 280 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 22 Merc 280C NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
23 23 Merc 450SE NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 24 Merc 450SL NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 25 Merc 450SLC NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
26 26 Pontiac Firebird NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 27 Porsche 914-2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 28 Toyota Corolla NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
29 29 Toyota Corona NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
30 30 Valiant NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
31 31 Volvo 142E NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN


Plotting a Heatmap with Seaborn
In [34]:
import matplotlib.pyplot as plt
import seaborn as sns
correlations = df.corr()
sns.heatmap(data = correlations,square = True, cmap = "bwr")
plt.yticks(rotation=0)
plt.xticks(rotation=90)

Out[34]:
(array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
11.5]),
[Text(0.5, 0, 'Unnamed: 0'),
Text(1.5, 0, 'mpg'),
Text(2.5, 0, 'cyl'),
Text(3.5, 0, 'disp'),
Text(4.5, 0, 'hp'),
Text(5.5, 0, 'drat'),
Text(6.5, 0, 'wt'),
Text(7.5, 0, 'qsec'),
Text(8.5, 0, 'vs'),
Text(9.5, 0, 'am'),
Text(10.5, 0, 'gear'),
Text(11.5, 0, 'carb')])

upload_2021-1-16_9-1-48.png
Problem Statement: Suppose you are a public school administrator. Some schools in your state of Tennessee are performing below average academically. Your superintendent under pressure from frustrated parents and voters approached you with the task of understanding why these schools are under performing. To improve school performance, you need to learn more about these schools and their students, just as a business needs to understand its own strengths and weaknesses and its customers. The data includes various demographic, school faculty, and income variables. Objective: Perform exploratory data analysis which includes: determining the type of the data, correlation analysis over the same . You need to convert the data into useful information:  Read the data in pandas data frame  Describe the data to find more details  Find the correlation between ‘reduced_lunch’ and ‘school_rating'
In [38]:
#Read the data in pandas data frame
df_schools = pandas.read_csv("C:/Users/ns45237/working_neeraj/ml_datasets/middle_tn_schools.csv")

In [40]:
#Describe the data to find more details
df_schools.shape

Out[40]:
(347, 15)
In [44]:
df_schools.head()

Out[44]:
name school_rating size reduced_lunch state_percentile_16 state_percentile_15 stu_teach_ratio school_type avg_score_15 avg_score_16 full_time_teachers percent_black percent_white percent_asian percent_hispanic
0 Allendale Elementary School 5.0 851.0 10.0 90.2 95.8 15.7 Public 89.4 85.2 54.0 2.9 85.5 1.6 5.6
1 Anderson Elementary 2.0 412.0 71.0 32.8 37.3 12.8 Public 43.0 38.3 32.0 3.9 86.7 1.0 4.9
2 Avoca Elementary 4.0 482.0 43.0 78.4 83.6 16.6 Public 75.7 73.0 29.0 1.0 91.5 1.2 4.4
3 Bailey Middle 0.0 394.0 91.0 1.6 1.0 13.1 Public Magnet 2.1 4.4 30.0 80.7 11.7 2.3 4.3
4 Barfield Elementary 4.0 948.0 26.0 85.3 89.2 14.8 Public 81.3 79.6 64.0 11.8 71.2 7.1 6.0

In [46]:
#Find the correlation between ‘reduced_lunch’ and ‘school_rating'
import matplotlib.pyplot as plt
import seaborn as sns
correlations = df_schools.corr()
sns.heatmap(data = correlations,square = True, cmap = "bwr")
plt.yticks(rotation=0)
plt.xticks(rotation=90)

Out[46]:
(array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
11.5, 12.5]),
[Text(0.5, 0, 'school_rating'),
Text(1.5, 0, 'size'),
Text(2.5, 0, 'reduced_lunch'),
Text(3.5, 0, 'state_percentile_16'),
Text(4.5, 0, 'state_percentile_15'),
Text(5.5, 0, 'stu_teach_ratio'),
Text(6.5, 0, 'avg_score_15'),
Text(7.5, 0, 'avg_score_16'),
Text(8.5, 0, 'full_time_teachers'),
Text(9.5, 0, 'percent_black'),
Text(10.5, 0, 'percent_white'),
Text(11.5, 0, 'percent_asian'),
Text(12.5, 0, 'percent_hispanic')])

upload_2021-1-16_9-1-48.png
Above shows that there is no correlation existing between School Rating & Reduced Lunch.Problem Statement: Mtcars, an automobile company in Chambersburg, United States has recorded the production of its cars within a dataset. With respect to some of the feedback given by their customers they are coming up with a new model. As a result of it they have to explore the current dataset to derive further insights out if it. Objective: Import the dataset, explore for dimensionality, type and average value of the horsepower across all the cars. Also, identify few of mostly correlated features which would help in modification.
In [48]:
#Read the mtcars dataset into a Pandas dataframe
df_mtcars = pandas.read_csv("C:/Users/ns45237/working_neeraj/ml_datasets/mtcars.csv")

In [49]:
#Explore the dimentionality
df_mtcars.shape

Out[49]:
(32, 13)
In [51]:
df_mtcars.head()

Out[51]:
Unnamed: 0 model mpg cyl disp hp drat wt qsec vs am gear carb
0 0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

In [56]:
#Find correlation amongst variables
import matplotlib.pyplot as plt
import seaborn as sns
correlations = df_mtcars.corr()
sns.heatmap(data = correlations,square = True, cmap = 'viridis')
plt.yticks(rotation=0)
plt.xticks(rotation=90)

Out[56]:
(array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
11.5]),
[Text(0.5, 0, 'Unnamed: 0'),
Text(1.5, 0, 'mpg'),
Text(2.5, 0, 'cyl'),
Text(3.5, 0, 'disp'),
Text(4.5, 0, 'hp'),
Text(5.5, 0, 'drat'),
Text(6.5, 0, 'wt'),
Text(7.5, 0, 'qsec'),
Text(8.5, 0, 'vs'),
Text(9.5, 0, 'am'),
Text(10.5, 0, 'gear'),
Text(11.5, 0, 'carb')])

upload_2021-1-16_9-1-48.png
In [54]:
df_mtcars['hp'].mean()

Out[54]:
146.6875

Data Wrangling

Missing Value Detection

In [59]:
#Shows NA values for all columns
df.isna()

Out[59]:
Unnamed: 0 model mpg cyl disp hp drat wt qsec vs am gear carb
0 False False False False False False False False False False False False False
1 False False False False False False False False False False False False False
2 False False False False False False False False False False False False False
3 False False False False False False False False False False False False False
4 False False False False False False False False False False False False False
5 False False False False False False False False False False False False False
6 False False False False False False False False False False False False False
7 False False False False False False False False False False False False False
8 False False False False False False False False False False False False False
9 False False False False False False False False False False False False False
10 False False False False False False False False False False False False False
11 False False False False False False False False False False False False False
12 False False False False False False False False False False False False False
13 False False False False False False False False False False False False False
14 False False False False False False False False False False False False False
15 False False False False False False False False False False False False False
16 False False False False False False False False False False False False False
17 False False False False False False False False False False False False False
18 False False False False False False False False False False False False False
19 False False False False False False False False False False False False False
20 False False False False False False False False False False False False False
21 False False False False False False False False False False False False False
22 False False False False False False False False False False False False False
23 False False False False False False False False False False False False False
24 False False False False False False False False False False False False False
25 False False False False False False False False False False False False False
26 False False False False False False False False False False False False False
27 False False False False False False False False False False False False False
28 False False False False False False False False False False False False False
29 False False False False False False False False False False False False False
30 False False False False False False False False False False False False False
31 False False False False False False False False False False False False False

In [61]:
#Shows NA values columnwise
df.isna().any()

Out[61]:
Unnamed: 0 False
model False
mpg False
cyl False
disp False
hp False
drat False
wt False
qsec False
vs False
am False
gear False
carb False
dtype: bool
In [62]:
#Shows NA values at Dataframe level
df.isna().any().any()

Out[62]:
False

Missing Value Treatment
#Used to impute mean values from sklearn.impute import SimpleImputer mean_imputer = SimpleImputer(missing_values=np.nan,strategy='mean') mean_imputer = mean_imputer.fit(df_mtcars['wt']) imputed_df = mean_imputer.transform(df_mtcars['wt']) df1 = pd.DataFrame(data=imputed_df,columns=cols) df1from sklearn.preprocessing import Imputer median_imputer=Imputer(missing_values=np.nan,strategy=‘median',axis=1) median_imputer = median_imputer.fit(df1) imputed_df = median_imputer.transform(df1.values) df1 = pd.DataFrame(data=imputed_df,columns=cols) df1

Dealing with an Outlier
import seaborn as sns sns.boxplot(x=df1['Assignment'])filter=df1['Assignment'].values>60 df1_outlier_rem=df1[filter] df1_outlier_remProblem Statement: Load the load_diabetes datasets internally from sklearn and check for any missing value or outlier data in the ‘data’ column. If any irregularities found treat them accordingly. Objective: Perform missing value and outlier data treatment.
In [78]:
#Read the diabetes dataset
from sklearn.datasets import load_diabetes

In [91]:
data = load_diabetes()
df_diabetes = pandas.DataFrame(data.data, columns=data.feature_names)
df_diabetes.head()

Out[91]:
age sex bmi bp s1 s2 s3 s4 s5 s6
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 -0.005671 -0.045599 -0.034194 -0.032356 -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031991 -0.046641

In [93]:
df_diabetes.isna().any().any()

Out[93]:
False
There are not NA ValuesProblem Statement: Mtcars, the automobile company in the United States have planned to rework on optimizing the horsepower of their cars, as most of the customers feedbacks were centred around horsepower. However, while developing a ML model with respect to horsepower, the efficiency of the model was compromised. Irregularity might be one of the causes. Objective: Check for missing values and outliers within the horsepower column and remove them.
In [94]:
#Read the mtcars dataset from Excel file
df = pandas.read_excel("C:/Users/ns45237/working_neeraj/ml_datasets/mtcars.xlsx")

In [95]:
df.dtypes

Out[95]:
Unnamed: 0 int64
Unnamed: 0.1 int64
model object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
dtype: object
In [98]:
#Explore the values in the hp column
df['hp'].isna()

Out[98]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 False
31 False
Name: hp, dtype: bool
In [101]:
#Using box plot, let us explore if there are any Outliers in horsepower column
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure()
df.boxplot(['hp'])

Out[101]:
<AxesSubplot:>

upload_2021-1-16_9-1-48.png
In [118]:
#Remove the Outlier and get filtered data
filter=df['hp'] < 275
df_filter = df[filter]

In [119]:
#Again Boxplot using the new filtered dataframe
plt.figure()
df_filter.boxplot(['hp'])

Out[119]:
<AxesSubplot:>

upload_2021-1-16_9-1-48.png

Outlier is now removed


Using Group By
In [123]:
import pandas as pd
world_cup={'Team':['West Indies','West indies','India','Australia','Pakistan','Sri Lanka','Australia','Australia','Australia','Insia','Australia'],
'Rank':[7,7,2,1,6,4,1,1,1,2,1],
'Year':[1975,1979,1983,1987,1992,1996,1999,2003,2007,2011,2015]}

df=pd.DataFrame(world_cup)
print(df.groupby(['Team','Rank']).groups)


{('Australia', 1): [3, 6, 7, 8, 10], ('India', 2): [2], ('Insia', 2): [9], ('Pakistan', 6): [4], ('Sri Lanka', 4): [5], ('West Indies', 7): [0], ('West indies', 7): [1]}


Using Concatenation
In [132]:
import pandas
world_champions={'Team':['India','Australia','West Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[2011,2015,1979,1992,1996],
'Points':[243,354,656,432,455]}

chokers={'Team':['South Africa','New Zealand','Zimbabwe'],'ICC_rank':[1,5,9], 'Points':[895, 764, 656]}
df1=pandas.DataFrame(world_champions)
df2=pandas.DataFrame(chokers)
print(pandas.concat([df1,df2],axis=1))


Team ICC_rank World_champions_Year Points Team \
0 India 2 2011 243 South Africa
1 Australia 3 2015 354 New Zealand
2 West Indies 7 1979 656 Zimbabwe
3 Pakistan 8 1992 432 NaN
4 Sri Lanka 4 1996 455 NaN

ICC_rank Points
0 1.0 895.0
1 5.0 764.0
2 9.0 656.0
3 NaN NaN
4 NaN NaN


Using Merging
In [ ]:
import pandas
champion_stats={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'ICC_rank':[2,3,7,8,4],
'World_champions_Year':[
'Points':[
match_stats={'Team':['India','Australia','West
Indies','Pakistan','Sri Lanka'],
'World_cup_played':[
'ODIs_played':[
df1=pandas.DataFrame(champion_stats)
df2=pandas.DataFrame(match_stats)
print(df1)
print(df2)
print(pandas.merge(df1,df2,on='Team'))
 
Top