# Machine Learning | Kaustubh Sakhare

Discussion in 'Big Data and Analytics' started by Nishant_Singh, Aug 13, 2019.

Joined:
Aug 1, 2018
Messages:
248
54
#1
2. ### Irene Boudarov Active Member Alumni

Joined:
Jan 24, 2019
Messages:
31
0
Hi Kaustubh Sakhare,

My name is Irene Boudarov.

I reviewed the OSL Slides and the demo file LogisticRegression.py. Can you please explain in the class the meaning of the plotting in that file, if you can review the meaning line by line, that would be ideal, so that we can apply this to other assignments and projects with understanding:

#==============================================================================
# let us visualize it
#==============================================================================

xx, yy = np.mgrid[-5:5:.01, -5:5:.01]
grid = np.c_[xx.ravel(), yy.ravel()]
probs = logRegClassifier.predict_proba(grid)[:, 1].reshape(xx.shape)

print(probs)

f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",
vmin=0, vmax=1)
ax_c = f.colorbar(contour)
ax_c.set_label("\$P(y = 1)\$")
ax_c.set_ticks([0, .25, .5, .75, 1])

ax.scatter(X_test[:, 0], X_test[:, 1],c = (y_test == 1 ), s=50,
cmap="RdBu", vmin=-.2, vmax=1.2,
edgecolor="white", linewidth=1)

ax.set(aspect="equal",
xlim=(-5, 5), ylim=(-5, 5),
xlabel="\$X_1\$", ylabel="\$X_2\$")

#==============================================================================
# So now let us visualize the Test set
#==============================================================================
plt.show()

#------------------------------------------

thank you so much in advance,
Irene

#2
3. ### Irene Boudarov Active Member Alumni

Joined:
Jan 24, 2019
Messages:
31
0
Hi Kaustubh,

This is Irene again here.

I was doing one of the projects in ML and trying to split the dataset to train (80%) and test (20%). After a split, I checked whether the split was accurately done and instead of splitting 80/20, it split 75/25. I was trying to debug for a long time and could not figure out why the split does not accurately split the way I want to 80/20. Here is my code:
-----------------------------
#### split housing dataset into train and test datasets: train=80% and test=20% #####
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

# check that the train and test dataset were split correctly
print("X_train size: ", X_train.shape, "; y_train size: ", y_train.shape)
print("X_test size: ", X_test.shape, "; y_test size: ", y_test.shape)

Output:
X_train size: (16512, 10) ; y_train size: (16512,)
X_test size: (4128, 10) ; y_test size: (4128,)

4128/16512
Output: 0.25

-------------------------------------------------

As you can see above that 25% of test data is different than what I was asking for (which as 20% of test data). Please let me know if there is anything wrong with this code and what is the reason train/test are not splitting correctly.

BTW, what is the difference between the following lines when I am trying to split, which one should be used for what scenario:

Option 1: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)
Option 2: X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=2)

thank you
Irene

#3
4. ### Irene Boudarov Active Member Alumni

Joined:
Jan 24, 2019
Messages:
31
0

Hi Kaustubh,
I just realized that I was calculating the % train/test wrong. I should have divided train over overall dataset shape:
4128/20640
Output: 0.20

My second question still holds regarding: what is the difference between the following lines when I am trying to split, which one should be used for what scenario:

Option 1: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)
Option 2: X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=2)

thank you
Irene[/QUOTE]

#4
5. ### Irene Boudarov Active Member Alumni

Joined:
Jan 24, 2019
Messages:
31
0
Hi Kaustubh,

I have started working on one of the projects (Project 4 - California Housing Price Prediction). I have questions about this project. Below are the list of questions, could you please help clarify them.

• Question 3: There are two ways to encode categorical data, which of them do you expect to use in this project or in other words, what would be a better option for ML model:

1. Encoding categorical data using LabelEncoder transforming ocean_proximity data to values 0,1,2,3,4 scale, OR

2. Adding additional columns for each ocean_proximity and put 1 for the columns that apply and 0 that don’t apply. This method will add additional 5 columns: 1hr ocean, inland, island, near bay, near ocean
In the Movielen project (from Data Science) I added additional columns to split the Genres categories before applying ML modeling.
• Question 5: Standardize data - would this be using StandardScaler?

• Question 6: Perform Linear Regression:

1. Predict output for test dataset using fitted model:

1. What is the meaning of fitted model? Are you referring to X_test data after standardizing it using StandardScaler?
2. Should I perform Linear Regression on test data using one feature or all features? House pricing is the target and other columns are features, there are many features that affecting the house pricing, should I use all features or only one feature for Linear Regression prediction?
• EDA and Feature Engineering: Generally, I noticed that there was no ask in this project for Exploratory Data Analysis (EDA) to analyze which features are affecting the output, shouldn’t we do this process before splitting the datasets and before the Machine Learning process? Also, what about the Feature Engineering in this project? Should we do that before splitting the data?

• Visualization: I noticed also that there was not much as for visualization in this project. Are you not expecting to visualize the data before applying machine learning algorithms? I thought that a good practice is to first understand the data (by visualizing relationships between features and target) before applying the ML model. Please confirm whether any visualization is required for this project to see the relationship/correlation between features and between feature/s and/or target.

thank you so much in advance,
Irene

#5
6. ### Irene Boudarov Active Member Alumni

Joined:
Jan 24, 2019
Messages:
31
0
Hi Kaustubh,

There are two ways to replace the missing values: 1) with fillna; or, 2) with Imputer

Here is my code below and I got an error:
------------------------------------------------------
#import Housing Dataset in a DataFrame

#summarize all the isnull values
np.sum(housing.isnull())
OUTPUT:

longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 207
population 0
households 0
median_income 0
ocean_proximity 0
median_house_value 0
dtype: int64​

# impute the missing values with 'mean'
from sklearn.preprocessing import Imputer
mean_imputer = Imputer(missing_values = np.nan, strategy='mean', axis=1)
mean_imputer = mean_imputer.fit(housing)
imputed_housing = mean_imputer.transform(housing.values)
housing = pd.DataFrame(data=imputed_housing, column = cols)

OUTPUT: I got a long error (please see screenshot attached)

I got two errors when using housing.values

here is the error when using housing['total_bedrooms'].values
attached my screenshot to show the error that I got when using Imputer to replace missing values:

In addition, please confirm what are the values of col? are they supposed to be all of the columns in the dataset? Or, only the column where the missing values are?

thank you
Irene

#6

Joined:
Nov 28, 2015
Messages:
1