Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Machine Learning| Feb 24 - Mar 18 | Jayanth

Jayanth_14

Member
Alumni
Trainer
Hello and Welcome to Machine Learning (Python) Course!
(I thought I had already posted this, but seems like it isn't showing up on this thread). I am equally excited to deliver this course. PFB the link for downloading Anaconda Navigator which you will need to install in your machine to perform hands-on exercises

For Mac Users:
https://repo.continuum.io/archive/Anaconda3-5.0.1-MacOSX-x86_64.pkg

For Windows Users:
https://repo.continuum.io/archive/Anaconda3-5.0.1-Windows-x86_64.exe

For Linux Users:
https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh

Incase if any of the above link gets outdated you can get the new link from https://www.anaconda.com/download

Best, Jay
 

D_K_Seal

Member
Alumni
# from sklearn.preprocessing import OneHotEncoder
# y_onehotencoder = OneHotEncoder ()
# y = y_onehotencoder.fit_transform(y).toarray()
# print(y)

here I am getting array 1d array passed instead of 2 D i used .reshape(-1,1) also but did not work ...Q is do we need to use one hot encoder for labels or label encoder if doneis fine......

If it is what we do when the required Labels has more than 2 option i know we will not use LR we wil use DT, SVM etc but do we need to convert the categorical labels to numerical

Regards,
Devakalpa
 

Ankit Vaish

Member
Alumni
Dear Jayanth ,

Following is the current mse for the soccer case study i got , i m trying to upload py file but not able to do so , will work on this problem tomorrow .

Thanks
ankit

upload_2018-3-18_0-57-54.png
 

Gotthard

Member
Alumni
First I used only the first 400 rows to avoid Python issues I am having with imputing X and y and with encoding the categorial data.
X = df.iloc [:400,9:].values
y = df.iloc [:400,4].values
With those limited rows and columns I am getting 1.58 for RMSE

If I got the imputing, encoding, etc.. right, then I am getting RMSE = 3.28 for the complete dateset
 

D_K_Seal

Member
Alumni
# imputing Y getting error ValueError: Found array with 0 feature(s) (shape=(400, 0)) while a minimum of 1 is required
from sklearn.preprocessing import Imputer
#first create an imputer
missingValueImputer=Imputer(missing_values='NaN', strategy='mean',axis=1)
#select the columns for which imputer needs to functin
missingValueImputer=missingValueImputer.fit(y[:,1:2])
#update the missing values
y[:,1:2]=missingValueImputer.transform(y[:,1:2])
 

D_K_Seal

Member
Alumni
since I was getting error in imputing Y vlaues have deleted all NAs and chosen 400 samples and 6 features the MSE betweeen y_test and y pred is 37.22 for 6 features , Sqrt of that is 6.1
 

Jayanth_14

Member
Alumni
Trainer
# from sklearn.preprocessing import OneHotEncoder
# y_onehotencoder = OneHotEncoder ()
# y = y_onehotencoder.fit_transform(y).toarray()
# print(y)

here I am getting array 1d array passed instead of 2 D i used .reshape(-1,1) also but did not work ...Q is do we need to use one hot encoder for labels or label encoder if doneis fine......

If it is what we do when the required Labels has more than 2 option i know we will not use LR we wil use DT, SVM etc but do we need to convert the categorical labels to numerical

Regards,
Devakalpa

Either one of One hot or Label encoder is fine (but not both)
First I used only the first 400 rows to avoid Python issues I am having with imputing X and y and with encoding the categorial data.
X = df.iloc [:400,9:].values
y = df.iloc [:400,4].values
With those limited rows and columns I am getting 1.58 for RMSE


If I got the imputing, encoding, etc.. right, then I am getting RMSE = 3.28 for the complete dateset

Hello Gotthard, is 3.28 for train or test?
 

Ankit Vaish

Member
Alumni
Code:
Dear Jayanth , Below is my final model for the Kaggle case study :

Code:
# coding: utf-8

# In[1]:


import pandas as pd
import numpy as np
import sqlite3


# In[2]:


cnx=sqlite3.connect('database.sqlite')
df=pd.read_sql_query("SELECT * FROM Player_Attributes" , cnx)
df.head()


# In[34]:


df.info()


# In[33]:


df.columns


# In[3]:


df.attacking_work_rate.value_counts()


# In[4]:


df.shape  # rows and columns


# In[5]:


df['overall_rating_1']=df['overall_rating']


# In[6]:


df.drop('overall_rating',axis=1,inplace=True)


# In[7]:


df.head()


# In[8]:


df.tail()


# In[9]:


#from sklearn.preprocessing import LabelEncoder
#df_labelencoder = LabelEncoder()
#df[:, 5] = df_labelencoder.fit_transform(df[:, 5])
#print df
df = df.dropna() # drop missing value


# In[10]:


df.drop(['preferred_foot','defensive_work_rate'],axis=1,inplace=True) #  to get drop out of multiple columns at one time


# In[11]:


df.head()


# In[12]:


df.drop(['id','player_fifa_api_id','player_api_id','date'],axis=1,inplace=True)


# In[13]:


#df.drop('player_fifa_api_id',axis=1,inplace=True)


# In[14]:


#df.drop('player_api_id',axis=1,inplace=True)


# In[15]:


#df.drop('date',axis=1,inplace=True)


# In[16]:


df.head()
df.describe()


# In[17]:


df.columns


# In[18]:


df.tail()


# In[19]:


x = df.iloc [:,:-1].values
y = df.iloc [:,-1].values


# In[20]:


from sklearn.preprocessing import LabelEncoder    # encoding attacking_work rate (catagrical variable to numerical)


x_labelencoder = LabelEncoder()
x[:,1] = x_labelencoder.fit_transform(x[:,1])


# In[21]:


from sklearn.preprocessing import OneHotEncoder      # encoding attacking_work rate (catagrical variable to numerical used when date is not in order
x_onehotencoder = OneHotEncoder (categorical_features = [0])
x = x_onehotencoder.fit_transform(x).toarray()


# In[22]:


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (x, y, test_size = 0.3,
                                                     random_state = 0)


# In[23]:


X_train.shape


# In[24]:


from sklearn.linear_model import LinearRegression  # import regerssion class
from sklearn.metrics import r2_score,mean_squared_error

lin_reg = LinearRegression()
lin_reg.fit (X_train,y_train)


# In[25]:


y_pred = lin_reg.predict(X_test)


# In[26]:


r2_score(y_test,y_pred)


# In[27]:


mean_squared_error(y_test,y_pred)


# In[29]:


y_pred1 = lin_reg.predict(X_train)


# In[30]:


r2_score(y_train,y_pred1)


# In[32]:


mean_squared_error(y_train,y_pred1)
 

Attachments

  • upload_2018-3-18_19-9-22.png
    upload_2018-3-18_19-9-22.png
    143.7 KB · Views: 16
  • soccel problem case study_Ankit.pdf
    518.5 KB · Views: 9
Last edited:

Dwight Kulkarni

Member
Alumni
Hello Jayanth,

Please find my assignment for the soccer project.

I was encountering some NaNs. These are replaced with the global mean. I used all the variables except the categorical variables.

RMSE and CODE output below python code.


----------------START PYTHON CODE-----------------------------

import sqlite3
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
import numpy as np


cnx = sqlite3.connect('C:\\Users\\Dwight Kulkarni\\Downloads\\soccer\\database.sqlite')
df = pd.read_sql_query("Select * from Player_Attributes",cnx)
df.head()

#print(type(df))
list(df)
print(df.info())


df.replace([np.inf, -np.inf], np.nan)
df.dropna()

#Seperate data with features and label
features = df.iloc[:,[5,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33]].values
label = df.iloc[:,[4]].values



#print (features)
from sklearn.cross_validation import train_test_split

X_train,X_test,y_train,y_test = train_test_split(features,
label,
test_size=0.3,
random_state=10)

print(type(X_train))

print(np.isnan(X_train).any())

print(np.isnan(y_train).any())

#####################X_train###############33
col_mean = np.nanmean(X_train, axis=0)
print(col_mean)

#Find indicies that you need to replace
inds = np.where(np.isnan(X_train))

#Place column means in the indices. Align the arrays using take
X_train[inds] = np.take(col_mean, inds[1])

print(np.isnan(X_train).any())


#############y_train######################
col_mean = np.nanmean(y_train, axis=0)
print(col_mean)

#Find indicies that you need to replace
inds = np.where(np.isnan(y_train))

#Place column means in the indices. Align the arrays using take
y_train[inds] = np.take(col_mean, inds[1])

print(np.isnan(y_train).any())

print(np.isinf(y_train).any())
print(np.isinf(X_train).any())

###############y_test#####################
col_mean = np.nanmean(y_test, axis=0)
print(col_mean)

#Find indicies that you need to replace
inds = np.where(np.isnan(y_test))

#Place column means in the indices. Align the arrays using take
y_test[inds] = np.take(col_mean, inds[1])
##########################################33

col_mean = np.nanmean(X_test, axis=0)
print(col_mean)

#Find indicies that you need to replace
inds = np.where(np.isnan(X_test))

#Place column means in the indices. Align the arrays using take
X_test[inds] = np.take(col_mean, inds[1])


##############################################3
from sklearn.linear_model import LinearRegression
model =LinearRegression()
model.fit(X_train,y_train)


y_predictions = model.predict(X_test)
y_predictions
model.score(X_test,y_test)

from sklearn.metrics import r2_score
print(r2_score(y_test,y_predictions))

def rmse(predictions, targets):
return np.sqrt(((predictions - targets) ** 2).mean())



print(rmse(y_test,y_predictions))

##############################END PYTHON CODE##################################3






##############################START OUTPUT#########################################
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183978 entries, 0 to 183977
Data columns (total 42 columns):
id 183978 non-null int64
player_fifa_api_id 183978 non-null int64
player_api_id 183978 non-null int64
date 183978 non-null object
overall_rating 183142 non-null float64
potential 183142 non-null float64
preferred_foot 183142 non-null object
attacking_work_rate 180748 non-null object
defensive_work_rate 183142 non-null object
crossing 183142 non-null float64
finishing 183142 non-null float64
heading_accuracy 183142 non-null float64
short_passing 183142 non-null float64
volleys 181265 non-null float64
dribbling 183142 non-null float64
curve 181265 non-null float64
free_kick_accuracy 183142 non-null float64
long_passing 183142 non-null float64
ball_control 183142 non-null float64
acceleration 183142 non-null float64
sprint_speed 183142 non-null float64
agility 181265 non-null float64
reactions 183142 non-null float64
balance 181265 non-null float64
shot_power 183142 non-null float64
jumping 181265 non-null float64
stamina 183142 non-null float64
strength 183142 non-null float64
long_shots 183142 non-null float64
aggression 183142 non-null float64
interceptions 183142 non-null float64
positioning 183142 non-null float64
vision 181265 non-null float64
penalties 183142 non-null float64
marking 183142 non-null float64
standing_tackle 183142 non-null float64
sliding_tackle 181265 non-null float64
gk_diving 183142 non-null float64
gk_handling 183142 non-null float64
gk_kicking 183142 non-null float64
gk_positioning 183142 non-null float64
gk_reflexes 183142 non-null float64
dtypes: float64(35), int64(3), object(4)
memory usage: 59.0+ MB
None
<class 'numpy.ndarray'>
True
True
[ 73.47621016 55.08387283 49.93687503 57.26646959 62.43459744
49.48314474 59.17995258 52.97871619 49.37481085 57.06843673
63.39478652 67.65861972 68.05107405 65.98942508 66.11997878
65.1921216 61.80150695 66.95880351 67.01580269 67.40760183
53.34060808 60.9268131 51.98286352 55.79043103 57.89107514
55.00919614]
False
[ 68.59980032]
False
False
False
[ 68.60051697]
[ 73.42334717 55.09390928 49.88421072 57.26498107 62.41817752
49.43409797 59.16395442 52.9352293 49.39527814 57.07324887
63.37509101 67.66107835 68.05164191 65.92768713 66.06573103
65.18336675 61.82457769 66.99295451 67.0916157 67.46403087
53.33668269 60.9975972 52.07090068 55.77734091 57.83263737
54.99182685]
0.787757159099 [r2_score]
3.24049947064 [RMSE]



###############################END OUTPUT FILE################################3
 

Vinayak Gupta

New Member
Alumni
Hi Jayanth

I have uploaded the code below.
Below are the steps followed so far
1. Taken only Numerical values as features ( Excluded Preferred_foot, Attack rate and Defense rate) - Tried label Encoding but syntax error
2. Dropped values with any Rows as NaN (missing) - Tried doing Imputation but not working
3. Data set size reduced by 1.7% (1.83 K rows to 1.80 K records)
4. Applied Linear regression
5. Results
R2: 0.812624670359
RMSE (Test) : 2.7801079290016966



 

Attachments

  • Soccer Project.zip
    3.1 KB · Views: 5
Last edited:

Arjun Hegde

Member
Alumni
# coding: utf-8

# In[120]:


import sqlite3
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


# In[24]:


def rmse(predictions, targets):
return np.sqrt(((predictions - targets) ** 2).mean())


# In[25]:


# Read Data
cnx = sqlite3.connect('database.sqlite')
df = pd.read_sql_query("SELECT * FROM player_Attributes",cnx)


# In[26]:


df = df.dropna()


# In[54]:


# Summary of data
df.head(20)


# In[72]:


# removing the duplicates
df = df.drop_duplicates()
df.shape



# In[29]:


df.shape


# In[47]:


X = df.iloc [:,5:42].values
y = df.iloc [:,4].values
X.shape


# In[55]:


from sklearn.preprocessing import LabelEncoder
X_labelencoder = LabelEncoder()
X[:, 1] = X_labelencoder.fit_transform(X[:, 1])
X[:, 2] = X_labelencoder.fit_transform(X[:, 2])
X[:, 3] = X_labelencoder.fit_transform(X[:, 3])


from sklearn.preprocessing import OneHotEncoder
X_labelencoder = OneHotEncoder(categorical_features = [3])
X = X_labelencoder.fit_transform(X).toarray()
print (X[0])


# In[56]:



from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.2,
random_state = 0)


# In[57]:


from sklearn.linear_model import LinearRegression
regressoragent = LinearRegression()
regressoragent.fit (X_train, y_train )


# In[58]:


predictValues = regressoragent.predict(X_test)


# In[124]:


print("Root Mean squared error: %.2f"
% rmse(y_test, predictValues))
regressoragent.score(X_train,y_train)
regressoragent.score(X_test,y_test)


# In[60]:


# Decision Tree

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error


# In[63]:


tree_reg = DecisionTreeRegressor(random_state=0, max_depth=50)
tree_reg.fit(X_train, y_train)


# In[122]:


r = tree_reg.predict(X_test)
rmse(y_test, r)
tree_reg.score(X_train,y_train)
tree_reg.score(X_test,y_test)



# In[ ]:


y_test.shape


# In[ ]:


#Visualize the data -EDA

from graphviz import Source
from sklearn import tree
from sklearn.tree import export_graphviz


# In[67]:


def visualize_tree(tree, feature_names):
"""Create tree png using graphviz.

Args
----
tree -- scikit-learn DecsisionTree.
feature_names -- list of feature names.
"""
with open("dt.dot", 'w') as f:
export_graphviz(tree, out_file=f,
feature_names=feature_names)

command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
try:
subprocess.check_call(command)
except:
exit("Could not run dot, ie graphviz, to "
"produce visualization")


# In[68]:


feat = list(df.columns.values)
feat.remove('overall_rating')
feat


# In[69]:


visualize_tree(tree_reg2,feat)


# In[70]:


#11. Random Forest

from sklearn.ensemble import RandomForestRegressor
RFregressor = RandomForestRegressor(n_estimators = 20, oob_score = True, n_jobs = -1,random_state =50, max_features = "auto")
RFregressor.fit(X_train, y_train)
predictionRF = RFregressor.predict (X_test)


# In[123]:


rmse(y_test, predictionRF)
RFregressor.score(X_train,y_train)
RFregressor.score(X_train,y_train)
RFregressor.score(X_test,y_test)


# In[74]:


sns.set()


# In[111]:


#Histogram - EDA

fig, axes = plt.subplots(len(df.columns)//3, 3, figsize=(12, 48))
print(len(axes))
i = 0
for triaxis in axes:
for axis in triaxis:
df.hist(column = df.columns, bins = 100, ax=axis)
i = i+1
plt.show()
 

Natesh

New Member
Alumni
Hi Jayanth,
Please find my assignment for the Soccer project.

Thanks,
Natesh
 

Attachments

  • natesh_pv_regression.zip
    27.3 KB · Views: 7

_18459

Member
Alumni
Hello

Please find an attached Soccer Project Code

Thanks

Sourabh Dhingra
 

Attachments

  • sourabhdhingra.zip
    4.7 KB · Views: 4

Anand18k

Member
Alumni
Hi Jayanth

I have uploaded my project1 below.
Below are the steps followed
1) Select relevant fields and dropping columns such as 'id', 'player_fifa_api_id', 'player_api_id', 'date'
2) Applied LabelEncoder for encoding the categorical data such as 'preferred_foot', 'attacking_work_rate' and ' defensive_work_rate'
3) Applied PCA for dimensionality reduction
4) Applied Linear Regression : RMSE(Test) = 2.7857762360428575
5) Applied Decision Trees : RMSE(Test) = 1.3384626343563018
6) Applied RandomForest : RMSE(Test) = 1.0068692338818137
 

Attachments

  • Anand_Singh_Project1.ipynb.zip
    84.5 KB · Views: 4

Dwight Kulkarni

Member
Alumni
I am trying for the KNN but getting an error.

I take the previous train-test-split data set and ran it through the assignment KNN code.

This error seems to be if the model is expecting categorical input, but Jayanth is saying KNN doesn't necessarily require this.

Any ideas ?

It is complaining about
ValueError: Unknown label type: 'continuous'

at: KNNClassifier.fit (X_train, y_train)



#############################################################

from sklearn.preprocessing import StandardScaler
independent_scalar = StandardScaler()
X_train = independent_scalar.fit_transform (X_train) #fit and transform
X_test = independent_scalar.transform (X_test) # only transform

#==============================================================================
# Fit the KNN to the train data.
#==============================================================================
from sklearn.neighbors import KNeighborsClassifier
KNNClassifier = KNeighborsClassifier (n_neighbors = 5, metric = 'minkowski',p =2)
KNNClassifier.fit (X_train, y_train)

#==============================================================================
# Predict the values
#==============================================================================

prediction = KNNClassifier.predict (X_test)

print("")
print("test data :-")
print(X_test)
print("predicted output :-")
print(prediction)
print("")
 

Purveshk

New Member
Alumni
Hi Jayanth,

For soccer example, my Linear Regression Test RMSE is 0.0001542317119991783 and Train RMSE is 0.00015365685531910266.

Decision Tree Test RMSE- 2.05985266; Train RMSE - 0.0

Code is as follows:
Code:
# coding: utf-8

# In[85]:


import sqlite3
import pandas as pd
import numpy as np

#Import Pre-processing libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

#Import Component analysis
from sklearn.decomposition import PCA

#Import Algorithms
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

#Import CrossValidation to train/test
from sklearn.cross_validation import train_test_split

#Import RMSE and Math libraries
from sklearn.metrics import mean_squared_error
from math import sqrt

#Import Graph plotting Libraries
from matplotlib import pyplot as plt


# In[36]:


#create connection
cnx = sqlite3.connect('./database.sqlite')
df = pd.read_sql_query("SELECT * FROM player_Attributes", cnx)


# In[52]:


# View the datatype
df.dtype


# In[89]:


# View first few records
df.head(5)


# In[57]:


# View the Statistics of Dataframe

df.describe()


# In[54]:


# View last few records
df.tail(5)


# In[58]:


# View the datatypes of Columns in dataframe

df.dtypes


# In[37]:


#Handle Blanks
df = df.dropna()

#Reorder the Dataframe Columns to get the Target Column at the end
df_new = df[['id', 'player_fifa_api_id', 'player_api_id', 'date', 'potential', 'preferred_foot', 'attacking_work_rate', 'defensive_work_rate', 'crossing', 'finishing', 'heading_accuracy', 'short_passing', 'volleys', 'dribbling', 'curve', 'free_kick_accuracy', 'long_passing', 'ball_control', 'acceleration', 'sprint_speed', 'agility', 'reactions', 'balance', 'shot_power', 'jumping', 'stamina', 'strength', 'long_shots', 'aggression', 'interceptions', 'positioning', 'vision', 'penalties', 'marking', 'standing_tackle', 'sliding_tackle', 'gk_diving', 'gk_handling', 'gk_kicking', 'gk_positioning', 'gk_reflexes', 'overall_rating']]


#df_new.dtypes


# In[38]:


#LabelEncode the String values
le = LabelEncoder()

df_new['date'] = le.fit_transform(df_new['date'].astype(str))
df_new['preferred_foot'] = le.fit_transform(df_new['preferred_foot'].astype(str))
df_new['attacking_work_rate'] = le.fit_transform(df_new['attacking_work_rate'].astype(str))
df_new['defensive_work_rate'] = le.fit_transform(df_new['defensive_work_rate'].astype(str))


# In[39]:


#Split the data

X = df_new.iloc [:,:-1].values
y = df_new.iloc [:,-1].values


# In[46]:


#Run the Decomposition to reduce the features

sklearn_pca = PCA(n_components=41)
X = sklearn_pca.fit_transform(df_new)


# In[47]:


# Split the data between Train & Test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


# In[48]:


# Standard Scale the Train data

independent_scalar = StandardScaler()
X_train = independent_scalar.fit_transform(X_train) # Fit & Transform
X_test = independent_scalar.transform(X_test) #Transform only


# In[49]:


# Feed the Scaled data to Linear Regression

regressoragent = LinearRegression()
regressoragent.fit(X_train, y_train)

# PRedict Train & Test values
Test_predictValues = regressoragent.predict(X_test)
Train_predictValues = regressoragent.predict(X_train)


# In[50]:


# Calculate Test RMSE

rms_test = sqrt(mean_squared_error(Test_predictValues, y_test))

print('RMSE for Test = ', rms_test)


# In[51]:


# Calculate Train RMSE

rms_train = sqrt(mean_squared_error(Train_predictValues, y_train))

print('RMSE for Train = ', rms_train)


# In[76]:


# Decision Tree with Best Splitter

tree_reg = DecisionTreeRegressor(splitter='best')
tree_reg.fit(X_train, y_train)

DT_TrainPredict = tree_reg.predict(X_train)
DT_TestPredict = tree_reg.predict(X_test)


# In[77]:


# Calculate Test RMSE for Decision Tree

DT_rmse_test = sqrt(mean_squared_error(DT_TestPredict, y_test))

print('RMSE for Test = ', DT_rmse_test)


# In[78]:


# Calculate Test RMSE for Decision Tree

DT_rmse_train = sqrt(mean_squared_error(DT_TrainPredict, y_train))

print('RMSE for Test = ', DT_rmse_train)


# In[79]:

#Print Feature importances for the DTree.
tree_reg.feature_importances_
 
Last edited:

Gotthard

Member
Alumni
Attached is my Jupyter Notebook file. I don't have time for the two extra credit questions now, but I do plan to work on those later and submit them.
Linear Regression : RMSE(Test, small subset) = 1.58
Linear Regression : RMSE(Test, full set) = 3.28
Decision Tree: RMSE(Test, full set) = 1.30
 

Attachments

  • Gotthard_Szabo_project1.zip
    62.6 KB · Views: 3

pavan juturu

New Member
Alumni
Linear Regression
np.sqrt(((predictValues-y_test)**2).mean())
RMSE :
2.7716170013245223
regr.score(X_train,y_train)
0.84266030987462759
regr.score(X_test,y_test)
0.84266030987462759

Random Forest
np.sqrt(((predictionRF-y_test)**2).mean())
0.90918102812690371
RFregressor.score(X_train,y_train)
0.99673116080375024
RFregressor.score(X_test,y_test)
0.98306935866702205
 

Attachments

  • project march 17 2018.zip
    12.9 KB · Views: 4

Shruti Shetti

New Member
Alumni
Hi Jayanth

Linear Regression

For test
RMSE : 2.7825010597124735
R2 : 0.8368778784791029

For Train
RMSE : 2.762087075377981
R2 : 0.8376033335459987


I have uploaded the code in the zipped folder.
 

Attachments

  • shruti_shetti_project.zip
    6 KB · Views: 3

Arjun Hegde

Member
Alumni
Linear Regression : RMSE(Test, full set) = 2.77
Decision Tree: RMSE(Test, full set) = 1.324
Random Forest: RMSE(Test, full set) = 0.9286
 

Attachments

  • Arjun_Hegde_project1.zip
    31 KB · Views: 8

Sahil_48

New Member
Alumni
Hello Jayanth I have attached the soccer prediction file
 

Attachments

  • Sahil_Bangera_project1.ipynb.zip
    7.4 KB · Views: 3

subbu_devarajan

New Member
Alumni
the root mean square error for test data is 3.49
r2_score is 0.96
the root mean square error training data is 3.50



import sqlite3
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
import numpy as np
# create db connection
cnx = sqlite3.connect('database.sqlite')
dfs = pd.read_sql_query ("select * from player_attributes", cnx)

#drop na data
dfs = dfs.dropna()

#display header
dfs.head()

dfs.columns

dfs.info()

dfs.drop(columns=['id', 'player_fifa_api_id', 'player_api_id', 'date','overall_rating','preferred_foot','attacking_work_rate','defensive_work_rate'], axis = 1, inplace = True)
dfs.head()

dfs.shape

#features in X, Target in y
X = dfs.iloc [:,:-1].values
y = dfs.iloc [:,-1].values
from sklearn.preprocessing import LabelEncoder
X_labelencoder = LabelEncoder()
X[:, 1] = X_labelencoder.fit_transform(X[:, 1])
X[:, 2] = X_labelencoder.fit_transform(X[:, 2])
X[:, 3] = X_labelencoder.fit_transform(X[:, 3])
from sklearn.preprocessing import OneHotEncoder
X_labelencoder = OneHotEncoder(categorical_features = [3])
X = X_labelencoder.fit_transform(X).toarray()
print (X[0])
X_train, X_test, y_train, y_test = train_test_split (x, y, test_size = 0.3, random_state = 0)
X_train.shape

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

lin_reg = LinearRegression ()
lin_reg.fit (X_train, y_train)

y_pred = lin_reg.predict(X_test)

rms = sqrt(mean_squared_error(y_test, y_pred))


print ('the root mean square error for test data is %.2f' %rms)
r2s = r2_score(y_test, y_pred)
print ('the root r2_score is %.2f' %r2s)

y_pred1 = lin_reg.predict(X_train)

rms = sqrt(mean_squared_error(y_test, y_pred))
print ('the root mean square error for test data is %.2f' %rms)
r2s = r2_score(y_test, y_pred)
print ('the root r2_score is %.2f' %r2s)

rms = sqrt(mean_squared_error(y_train, y_pred1))
print ('the root mean square error training data is %.2f' %rms)
r2s = r2_score(y_train, y_pred1)
print ('the root r2_score is %.2f' %r2s)

X_test.shape

X_train.shape
 

Attachments

  • subbu_devarajan_soccer_assignment_updated.pdf
    75.5 KB · Views: 3
Last edited:

Ankit Vaish

Member
Alumni
My new code :--

upload_2018-3-18_23-37-7.png


# In[20]:


from sklearn.preprocessing import LabelEncoder # encoding attacking_work rate (catagrical variable to numerical)


x_labelencoder = LabelEncoder()
x[:,1] = x_labelencoder.fit_transform(x[:,1])


# In[21]:


from sklearn.preprocessing import OneHotEncoder # encoding attacking_work rate (catagrical variable to numerical used when date is not in order
x_onehotencoder = OneHotEncoder (categorical_features = [0])
x = x_onehotencoder.fit_transform(x).toarray()


# In[22]:


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (x, y, test_size = 0.3,
random_state = 0)


# In[23]:


X_train.shape


# In[37]:


from sklearn.linear_model import LinearRegression # import regerssion class
from sklearn.metrics import r2_score,mean_squared_error

lin_reg = LinearRegression()
lin_reg.fit (X_train,y_train)


# In[25]:


y_pred = lin_reg.predict(X_test)


# In[26]:


r2_score(y_test,y_pred)


# In[39]:


from math import sqrt


# In[42]:


rms_test=sqrt(mean_squared_error(y_test,y_pred))
rms_test


# In[29]:


y_pred1 = lin_reg.predict(X_train)


# In[30]:


rmsr2_score(y_train,y_pred1)


# In[43]:


rms_train=sqrt(mean_squared_error(y_train,y_pred1))
rms_train
 

_23873

Girish
Alumni
Train RMSE( 96908 records) = 7.643677114482245
Test RMSE( 41532 records) = 7.659447667960773
Please find the attached notebook.
 

Attachments

  • Soccer-Players-2.ipynb.zip
    6.9 KB · Views: 3

Sumalatha Pampapathi

New Member
Alumni
Hi Jayanth,

When I try to use sklearn function for soccer data, I m getting below error
ValueError: could not convert string to float:

how do I over come this error


Regards,
Suma
 

KALYAN DEY

Member
Alumni
*thread locked for this batch learners.
Post your questions below.

I have completed the data science batch with R, Python & SAS and completed machine learning course also. Could you please send me a sample CV/resume of data science profile for 6+ years experience in software development ?
 
Top