Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Machine Learning Advanced Certification | April 19 - May 7

Hello sir,

the answer to the question, print all the odd numbers between 1 to 100 and leave every third number is as follows:
**********
a = list(map(lambda x: x, range(1,101,2)))
print(a)


for ele in sorted(range(2,len(a),3), reverse = True):
del a[ele]

print (*a)
*********

I got the final output as

1 3 7 9 13 15 19 21 25 27 31 33 37 39 43 45 49 51 55 57 61 63 67 69 73 75 79 81 85 87 91 93 97 99

We can check and discuss on this in tomorrow class.

Thanks,
Mohana.
 
Problem Statement : Print the odd number from list 1 to 100 skipping every third number

[x for x in range(1,101,2) if x not in list(filter(lambda x:x,range(1,101,2)))[2::3]]

Another way:

[(lambda x:x)(x) for x in range(1,101,2) if x not in list(filter(lambda x:x,range(1,101,2)))[2::3]]
 
problem Statement :
Mtcars, an automobile company in Chambersburg, United States has recorded the production of its cars within a dataset. With respect to some of the feedback given by their customers they are coming up with a new model. As a result of it they have to explore the current dataset to derive further insights out if it.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

dataframe=pd.read_csv('mtcars.csv')

dataframe.head(n=4)
dataframe.info()

dataframe.describe(include='O')
dataframe.describe()

np.mean(dataframe['hp'])

df_corr=dataframe.corr()
df_corr

Correlated Features​

mpg - cyl --> Highly Negative Corrleated

mpg - disp --> Highly Negative Corrleated

mpg - hp --> Highly Negative Corrleated

mpg - wt --> Highly Negative Corrleated
 
Problem Statement: Load the load_diabetes datasets internally from sklearn and check for any missing value or outlier data in the ‘data’ column. If any irregularities found treat them accordingly.

from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
print(diabetes.DESCR)

df=pd.DataFrame(data=diabetes.data,columns=diabetes.feature_names)
df['Target']=diabetes.target

df.head(n=4)
df.shape
df.info()

#Missing value (No Missing Value)

df.isna().sum().sort_values(ascending=False)

for feature in df.columns:
sns.boxplot(df[feature])
plt.show()

#These feature having outlier​

#bmi --> >0.12

#s6 --> < -0.10 and >0.12

#s5 --> >0.12

#s4 --> >0.14

#s3 --> >0.12

#s2 --> >0.11

#s1 --> >0.11


df_new=df[~((df['bmi']>0.12)|((df['s6']< -0.10) | (df['s6']>0.12))|(df['s5']>0.12)|(df['s4']>0.14)|(df['s3']>0.12)|(df['s2']>0.11)|(df['s1']>0.11))]

for feature in df_new.columns:
sns.boxplot(df_new[feature])
plt.show()
 
Notebook prints (PDF) attached as worked on 'mtcars' and 'diabetes' datasets.
 

Attachments

  • ML_mtcars_20210420_JBR.pdf
    191.4 KB · Views: 5
  • ML_diab_20210420_JBR.pdf
    215.1 KB · Views: 7

Loganathan Kumarasamy

Customer
Customer
Problem Desc:

Mtcars, an automobile company in Chambersburg, United States, has recorded the production of its cars within a dataset. The company is coming up with a new model based on the feedback given by its customers. It has to explore the current dataset to derive further insights from it.

Objective: Import the dataset, explore for dimensionality, and type and average value of the horsepower across all the cars. Also, identify a few of mostly correlated features, which would help in modification.

import pandas as pd

df_mtcars = pd.read_csv("mtcars.csv")
df_mtcars

df_mtcars.shape
df_mtcars.describe()

df_mtcars.dtypes
df_mtcars.mean()

df_mtcars['hp'].mean()

import seaborn as sns
sns.heatmap(df_mtcars.corr())


Highly correlated features:

cyl and disp
wt and hp
hp and cyl
 
Problem 1 : mtcars

import pandas as pd
import seaborn as sns

df_mtcars = pd.read_csv("mtcars.csv")
df_mtcars.head()

df_mtcars.shape
df_mtcars.dtypes

df_mtcars['hp'].mean()

sns.heatmap(df_mtcars.corr());

Results : a) (32, 12) - shape

b)
model object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
dtype: object

c) 146.6875 - hp average

d) some of the correlation results
mpg is negatively correlated with cyl, disp, hp, carb,wt
cyl is negatively correlated with vs, mpg, drat
hp is negatively correlated with vs, qsec and mpg
 
Problem 2 : Load Diabetes

import pandas as pd
from sklearn.datasets import load_diabetes
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

diabetes = load_diabetes()

df_diabetes = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df_diabetes.head()

df_diabetes.isna().any()

O/P
age False
sex False
bmi False
bp False
s1 False
s2 False
s3 False
s4 False
s5 False
s6 False
dtype: bool

import seaborn as sns

for keys in df_diabetes.columns:
sns.boxplot(df_diabetes[keys])
plt.show()

# Outliers are there for bmi, s1, s2, s3, s4, s5 and s6

dict_new = {"s4" : 0.145, "s5" : 0.125, "s3" : 0.12, "s2" : 0.11, "bmi" : 0.125}

for keys in dict_new:
filter_new = df_diabetes[keys] < dict_new.get(keys)
df_diabetes_new = df_diabetes[filter_new]
sns.boxplot(df_diabetes_new[keys])
plt.show()

#For the columns s6 and s1, we have outliers on both sides so separately done, still working on it to bring this under the for loop.

filter_s6 = (df_diabetes['s6'] < 0.115) & (df_diabetes['s6']>-0.12)
df_diabetes_s6 = df_diabetes[filter_s6]
sns.boxplot(df_diabetes_s6['s6']);
plt.show()

filter_s1 = (df_diabetes['s1'] < 0.11) & (df_diabetes['s1']>-0.12)
df_diabetes_s1 = df_diabetes[filter_s1]
sns.boxplot(df_diabetes_s1['s1']);
plt.show()
 
MTCARS data problem

# dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# dataset import
mtcars = pd.read_csv("mtcars.csv")
print(mtcars.head())

# dataset info > shape
print("Shape of dataset:", mtcars.shape)
# dataset info missing values > True = Missing values, False -> No missing value
print("Missing values:", mtcars.isna().any().any())

# distinct number of cylinders in car models
print("Number of distinct car cylinders: ", mtcars['cyl'].unique())

# statisitcal summary of car milage "mpg" for distinct car cylinders "cyl"
print("Statical summary of car milage for distinct number of car cylinders.")
print(mtcars.groupby(['cyl'])['mpg'].agg('describe'))

# correlation plot > heatmap
plt.figure(figsize=(8,8))
sns.heatmap(mtcars.corr(),
square=True, fmt='.1g', annot=True, cmap='RdBu_r')
plt.show()
mtcars_correlation.png
# Distribution of car milage "mpg" provided number of car cylinders "cyl" > kde plot
fig, ax = plt.subplots(1, 1)
for n, cyl in enumerate(np.sort(mtcars['cyl'].unique())):
sns.kdeplot(mtcars[mtcars['cyl'] == cyl]['mpg'], label=cyl)
ax.axes.get_yaxis().set_visible(False)
plt.legend()
plt.show()
mpg_cyl.png
# Relationship between weight "wt" and milage "mpg" for car models
plt.figure(figsize=(5,5))
sns.scatterplot(x='wt', y='mpg', data=mtcars)
plt.show()
wt_vs_mpg.png

# OBSERVATIONS
# - The dataset is tidy consists of 12 features for 32 cars models.
# - There are no missing values.
# - There are three types of car cylinders with reference to number of cylinders:
# - 4 cylinders
# - 6 cylinders (Least common)
# - 8 cylinders (Most common)
# - The heatmap plot reveals the most correlated features with car milage (mpg)
# - cyl (# of cylinders): High negative correlation (-0.9)
# - wt (weight of car in 1000 lbs): Hight negative correlation (-0.9)
# - There is an inverse relationship between car milage & number of cylinders along with car weight.
 
Last edited:
MTCARS

# Importing pandas and reading the file
import pandas as pd
df_mtcars = pd.read_csv(r"C:\Users\alanw\Jupyter\3. Machine Learning\first week\mtcars.csv")


# Analyzing the dimensions of the dataframe
df_mtcars.shape

(32, 12)


# Checking the values of the columns to be able to filter them later
df_mtcars.columns.unique()

Index(['model', 'mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am',
'gear', 'carb'],
dtype='object')


# Showing a table with the average hp per car and ordering it by descending hp value
df_mtcars.groupby(['model']).mean().filter(['hp']).sort_values('hp', ascending=False)

modelhp
Maserati Bora335
Ford Pantera L264
Camaro Z28245
Duster 360245
Chrysler Imperial230
Lincoln Continental215
Cadillac Fleetwood205
Merc 450SLC180
Merc 450SL180
Merc 450SE180
Hornet Sportabout175
Pontiac Firebird175
Ferrari Dino175
AMC Javelin150
Dodge Challenger150
Merc 280123
Merc 280C123
Lotus Europa113
Hornet 4 Drive110
Mazda RX4110
Mazda RX4 Wag110
Volvo 142E109
Valiant105
Toyota Corona97
Merc 23095
Datsun 71093
Porsche 914-291
Fiat 12866
Fiat X1-966
Toyota Corolla65
Merc 240D62
Honda Civic52

# Printing correlation between columns
df_mtcars.corr()


mpgcyldisphpdratwtqsecvsamgearcarb
mpg1.000000-0.852162-0.847551-0.7761680.681172-0.8676590.4186840.6640390.5998320.480285-0.550925
cyl-0.8521621.0000000.9020330.832447-0.6999380.782496-0.591242-0.810812-0.522607-0.4926870.526988
disp-0.8475510.9020331.0000000.790949-0.7102140.887980-0.433698-0.710416-0.591227-0.5555690.394977
hp-0.7761680.8324470.7909491.000000-0.4487590.658748-0.708223-0.723097-0.243204-0.1257040.749812
drat0.681172-0.699938-0.710214-0.4487591.000000-0.7124410.0912050.4402780.7127110.699610-0.090790
wt-0.8676590.7824960.8879800.658748-0.7124411.000000-0.174716-0.554916-0.692495-0.5832870.427606
qsec0.418684-0.591242-0.433698-0.7082230.091205-0.1747161.0000000.744535-0.229861-0.212682-0.656249
vs0.664039-0.810812-0.710416-0.7230970.440278-0.5549160.7445351.0000000.1683450.206023-0.569607
am0.599832-0.522607-0.591227-0.2432040.712711-0.692495-0.2298610.1683451.0000000.7940590.057534
gear0.480285-0.492687-0.555569-0.1257040.699610-0.583287-0.2126820.2060230.7940591.0000000.274073
carb-0.5509250.5269880.3949770.749812-0.0907900.427606-0.656249-0.5696070.0575340.2740731.000000

The three lowest correlations (we could say there is almost no correlation) are:
1) am and carb 0.057534
2) drat and carb -0.090790
3) drat and qsec 0.091205

There three highest correlations are:
1) cyl and disp 0.902033
2) disp and wt 0.887980
3) mpg and wt -0.867659
 
Problem 1: mtcars

import pandas as pd

df = pd.read_csv('mtcars.csv')

df.shape


df.head()

Out[4]:
modelmpgcyldisphpdratwtqsecvsamgearcarb
0Mazda RX421.06160.01103.902.62016.460144
1Mazda RX4 Wag21.06160.01103.902.87517.020144
2Datsun 71022.84108.0933.852.32018.611141
3Hornet 4 Drive21.46258.01103.083.21519.441031
4Hornet Sportabout18.78360.01753.153.44017.020032
In [5]:
df.dtypes

Out[5]:
model object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
dtype: object
In [6]:
df['hp'].mean()

Out[6]:
146.6875
In [7]:
import seaborn as sns

In [8]:
sns.heatmap(df.corr())

Out[8]:
<AxesSubplot:>

1618946086020.png

  1. cyl is negatively correlated with drat, vs and mpg
  2. wt is negatively correlated with drat, mpg and am
  3. hp is negatively correlated withmpg,qsec and vs

Problem 2: Diabetes

import pandas as pd
from sklearn.datasets import load_diabetes
import numpy as np

In [7]:
df = load_diabetes()

In [9]:
print(df.DESCR)


.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:
- age age in years
- sex
- bmi body mass index
- bp average blood pressure
- s1 tc, T-Cells (a type of white blood cells)
- s2 ldl, low-density lipoproteins
- s3 hdl, high-density lipoproteins
- s4 tch, thyroid stimulating hormone
- s5 ltg, lamotrigine
- s6 glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [12]:
df1=pd.DataFrame(data=df.data,columns=df.feature_names)

In [13]:
df1.head()

Out[13]:
agesexbmibps1s2s3s4s5s6
00.0380760.0506800.0616960.021872-0.044223-0.034821-0.043401-0.0025920.019908-0.017646
1-0.001882-0.044642-0.051474-0.026328-0.008449-0.0191630.074412-0.039493-0.068330-0.092204
20.0852990.0506800.044451-0.005671-0.045599-0.034194-0.032356-0.0025920.002864-0.025930
3-0.089063-0.044642-0.011595-0.0366560.0121910.024991-0.0360380.0343090.022692-0.009362
40.005383-0.044642-0.0363850.0218720.0039350.0155960.008142-0.002592-0.031991-0.046641
In [14]:
df1.shape

Out[14]:
(442, 10)
In [15]:
df1.dtypes

Out[15]:
age float64
sex float64
bmi float64
bp float64
s1 float64
s2 float64
s3 float64
s4 float64
s5 float64
s6 float64
dtype: object
In [18]:
import seaborn as sns

In [20]:
for keys in df1.columns:
sns.boxplot(df1[keys])
plt.show()




1618946154621.png
In [ ]:
 
MTCARS problem -> Check for missing values and outliers within the horsepower column remove them.


Python:
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
import numpy as np
%matplotlib inline

mtcars = pd.read_csv("mtcars.csv")

print("Files Imported Sucessfully...!")

description = mtcars.describe()
#print(description)

data_types = mtcars.dtypes
#print(data_types)

missing_values = mtcars.isna().any()
#print(missing_values)

#sns.boxplot(mtcars['hp'])

filt = mtcars["hp"].values<300

mtcars_filt = mtcars[filt]

sns.boxplot(mtcars_filt['hp'])[/CENTER]
 

Attachments

  • 1.png
    1.png
    3.1 KB · Views: 0
  • 2.png
    2.png
    2.8 KB · Views: 0
DIABETES data problem

# Working with sklearn's diabetes dataset

# dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# load data
from sklearn.datasets import load_diabetes
diabetes = pd.DataFrame(np.c_[load_diabetes().data, load_diabetes().target],
columns=load_diabetes().feature_names + ['target'])

# view
print("\n", diabetes.head())

# about diabetes data
print("\n", load_diabetes().DESCR)

# data info > shape
print("\nShape of diabetes features data: ", diabetes.shape)

# data info > missing data (True: Missing data; False: No missing data)
print("\nMissing data: ", diabetes.isna().any().any())

# statistical summary > describe
print("\nStatistical summary")
print(round(diabetes.describe().T, 2))

# statistical summary > correlation
plt.figure(figsize=(7, 7))
sns.heatmap(diabetes.corr(), square=True, fmt='.2g', annot=True, cmap='RdBu_r')
plt.title("Correlation in diabetes dataset")
plt.show()
diabetes_correlation.png

# relationship between "target" & correlated features (bmi & s5)
fig , axes = plt.subplots(1, 2, figsize=(10, 5), sharey=True)
sns.scatterplot(x='bmi', y='target', data=diabetes, ax=axes[0])
axes[0].set_title("Relationship between BMI & diesease progression")
sns.scatterplot(x='s5', y='target', data=diabetes, ax=axes[1])
axes[1].set_title("Relationship between lamotrigine & disease progression")
plt.tight_layout()
plt.show()
reationship_with_target.png

# outlier detection
for n, col in enumerate(diabetes.columns):
plt.figure(figsize=(5, 1))
sns.boxplot(data=diabetes, x=col)
plt.show()
bmi_outlier.png
s1_outlier.png
s2_outlier.png
s3_outlier.png
s4_outlier.png
s5_outlier.png
s6_outlier.png

# OBSERVATIONS
# - Target variable (i.e disease progression) has **positive moderate correlation** with following features:
# - bmi (Body Mass Index): positive correlation (0.59)
# - s5 (lamotrigine): positive correlation (0.57)
# - The problem of collinearity could be due to following correlated features:
# - s1 & s2 (0.9)
# - s3 & s4 (-0.74)
# - s2 & s4 (0.66)
# - s5 & s5 (0.62)
# - The following columns have outliers present:
# - bmi
# - s1
# - s2
# - s3
# - s4
# - s5
# - s6
 
Load_Diabetes -> Perform missing value and outlier data treatment.

Python:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.datasets import load_diabetes
%matplotlib inline
diabetes = load_diabetes()

print("Files Imported Sucessfully...!")

#print(diabetes.DESCR)
df=pd.DataFrame(data=diabetes.data,columns=diabetes.feature_names)

#df.head()

df.isna().any()


for col in df.columns:
    fil=df[col].values<-12
    df_new=df[fil]
    sns.boxplot(df[col])
    plt.show()

4.png5.png3.png6.png7.png8.png9.png10.png11.png12.png
 
Last edited:
Problem Statement :
SFO Public Department - referred to as SFO has captured all the salary data of its employees from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to rationalize employee cost to save payroll budget. You have to do data manipulation and answer the below questions:

1. How much total salary cost has increased from year 2011 to 2014?
2. Who was the top earning employee across all the years?
 

Attachments

  • Untitled.pdf
    46.8 KB · Views: 13
Problem Statement :
SFO Public Department - referred to as SFO has captured all the salary data of its employees from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to rationalize employee cost to save payroll budget. You have to do data manipulation and answer the below questions:

1. How much total salary cost has increased from year 2011 to 2014?
2. Who was the top earning employee across all the years?
 

Attachments

  • Untitled (1).pdf
    48.2 KB · Views: 11
SFO Public Department Problem

# dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# read data
salary = pd.read_csv("./Salaries.csv", low_memory=False)

# view data
print(salary.head())

# increase in TotalPay by Years
salary_pay_by_year = salary.groupby('Year').sum()[['TotalPay']].reset_index()
sns.lineplot(x='Year', y='TotalPay', data=salary_pay_by_year)
plt.title("Total pay change by year")
plt.show()
total_pay_by_year.png
# top earning employee across all the years
salary_top_pay = salary.groupby('EmployeeName').sum()['TotalPay'].sort_values(ascending=False).head()
salary_top_pay = pd.DataFrame(salary_top_pay).reset_index()
sns.barplot(x='EmployeeName', y='TotalPay', data=salary_top_pay)
plt.title("Top 5 paid employees")
plt.show()
top_5_pay_employees.png
 
Data Science - Machine Learning Course- As discussed, I have attached three assignments given in Lesson - Data Preprocessing-
 

Attachments

  • SFOPublicDapartment_Assignement.pdf
    49.6 KB · Views: 4
  • DataExploration_mtcars_Assignment.pdf
    68.2 KB · Views: 3
  • DataExploring_Diabetes_Assignment.pdf
    137.5 KB · Views: 2
Lesson-end Project :: Lesson 3

# dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# DataFrame
data = pd.DataFrame({
"first_name": ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
"last_name": ["Miller", "Jacobson", ".", "Milner", "Cooze"],
"age": [42, 52, 36, 24, 73],
"preTestScore": [4, 24, 31, ".", "."],
"postTestScore": ["25,000", "94,000", 57, 62, 70]
})

# Save the data frame into a .csv file as project.csv
data.to_csv("project.csv", index=False)

# Read the project.csv file and print the data frame.
project = pd.read_csv("project.csv")
print(project.head())

# Read the project.csv file without column heading.
pd.read_csv('project.csv', header=None, skiprows=1)

# Read the project.csv file and make two index columns, namely, ‘First Name’ and ‘Last Name’.
pd.read_csv("project.csv", index_col=['first_name', 'last_name'])

# Print the data frame in a Boolean form as True or False.
# True for Null/ NaN values and false for non-null values.
project = pd.read_csv('project.csv', na_values='.')
print(project.head())
print(project.isna())

# Read the data frame by skipping the first 3 rows and print the data frame.
project = pd.read_csv("project.csv", skiprows=[1,2,3])
print(project)
 
Problem Statement :
SFO Public Department - referred to as SFO has captured all the salary data of its employees from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to rationalize employee cost to save payroll budget. You have to do data manipulation and answer the below questions:

1. How much total salary cost has increased from year 2011 to 2014?
2. Who was the top earning employee across all the years?
 

Attachments

  • Salaries_assign.pdf
    25.1 KB · Views: 9
Problem Statement :
SFO Public Department - referred to as SFO has captured all the salary data of its employees from year 2011-2014. Now in 2018 the organization is facing some financial crisis. As a first step HR wants to rationalize employee cost to save payroll budget. You have to do data manipulation and answer the below questions:

1. How much total salary cost has increased from year 2011 to 2014?
2. Who was the top earning employee across all the years? - I use sort_values to find out the top earning employee, is it possible to use .max() function? I tried it but it could only show its value, unable to retrieve the employee's name.
 

Attachments

  • Salaries.pdf
    28 KB · Views: 7
Problem Statement:raw data:
"first_name": ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
"last_name": ["Miller", "Jacobson", ".", "Milner", "Cooze"],
"age": [42, 52, 36, 24, 73],
"preTestScore": [4, 24, 31, ".", "."],
"postTestScore": ["25,000", "94,000", 57, 62, 70]
})
 

Attachments

  • Test_Score.pdf
    23.3 KB · Views: 2
DATA MANIPULATUION

SFO Public Department

Python:
import pandas as pd
import seaborn as sns
import matplotlib as plt
%matplotlib inline
salary = pd.read_csv("Salaries.csv",low_memory=False)
df=pd.DataFrame(salary)

# Total salary cost has increased from year 2011 to 2014

df.info()
df.head()
#df['Year'].unique()
feature=df[['Year','TotalPay']]
feature

salary_mean = df.groupby('Year').mean()[['TotalPay']]
print(salary_mean)

salary_dif = salary_mean.loc[2014]-salary_mean.loc[2011]
salary_dif

sns.lineplot(data=salary_mean)
emp_mean = df.groupby('Year').max()[['TotalPay','EmployeeName']].reset_index()
emp_mean
sns.barplot(data=emp_mean,x='EmployeeName', y='TotalPay')
 

Attachments

  • DATA_MANIPULATION.pdf
    42.5 KB · Views: 1
LESSON_END_PROJECT
Python:
import pandas as pd
import seaborn as sns
import matplotlib as plt
%matplotlib inline

print("Library are now Accessable...")
data = pd.DataFrame({'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
                     'last_name':['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
                     'age': [42, 52, 36, 24, 73],
                     'preTestScore': [4, 24, 31, ".", "."],
                     'postTestScore': ["25,000", "94,000", 57, 62, 70]})
data

# 1. save dataframe into csv file

data.to_csv("project.csv")
print("Data Exported Sucessfully as 'project.csv' ")

# 2. Read project.csv and print the dataframe amd removed heading

raw = pd.read_csv("project.csv")#header=None
raw=raw.drop(columns = {'Unnamed: 0'}, inplace = False)

print(pd.DataFrame(raw))

print("\nproject.csv printed Sucessfully as 'Dataframe' ")

# Rename columns
raw=raw.rename(columns = {'first_name': 'First Name', 'last_name': 'Last Name'}, inplace = False)
print("Column renamed Sucessfully\n")
print(raw[['First Name','Last Name']])

# finding any missing values
print("Data Loaded Sucessfully\n")
raw.isna()

# Remove first 3 rows[0,1,2]
print("Data Loaded Sucessfully")
raw.iloc[3:]
 

Attachments

  • LESSON_END_PROJECT.pdf
    27.4 KB · Views: 6
Hello sir,

Problem Statement: A real estate company wants to build homes at different locations in Boston. They have data for historical prices but haven’t decided the actual prices yet. They want to price it so that it is affordable to the general public.
Objective: • Import the Boston data from sklearn and read the description using DESCR • Analyze the data and predict the approximate prices for the houses

Solution is attached.

Regards,
Mohana.
 

Attachments

  • Boston Reg Assign.pdf
    113.6 KB · Views: 8
Top