Machine Learning | Kaustubh Sakhare

Discussion in 'Big Data and Analytics' started by Nishant_Singh, Aug 13, 2019.

  1. Nishant_Singh

    Nishant_Singh Well-Known Member
    Simplilearn Support

    Joined:
    Aug 1, 2018
    Messages:
    222
    Likes Received:
    30
    #1
  2. Irene Boudarov

    Irene Boudarov Active Member
    Alumni

    Joined:
    Jan 24, 2019
    Messages:
    22
    Likes Received:
    0
    Hi Kaustubh Sakhare,

    My name is Irene Boudarov.

    I reviewed the OSL Slides and the demo file LogisticRegression.py. Can you please explain in the class the meaning of the plotting in that file, if you can review the meaning line by line, that would be ideal, so that we can apply this to other assignments and projects with understanding:

    #==============================================================================
    # let us visualize it
    #==============================================================================

    xx, yy = np.mgrid[-5:5:.01, -5:5:.01]
    grid = np.c_[xx.ravel(), yy.ravel()]
    probs = logRegClassifier.predict_proba(grid)[:, 1].reshape(xx.shape)

    print(probs)

    f, ax = plt.subplots(figsize=(8, 6))
    contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",
    vmin=0, vmax=1)
    ax_c = f.colorbar(contour)
    ax_c.set_label("$P(y = 1)$")
    ax_c.set_ticks([0, .25, .5, .75, 1])

    ax.scatter(X_test[:, 0], X_test[:, 1],c = (y_test == 1 ), s=50,
    cmap="RdBu", vmin=-.2, vmax=1.2,
    edgecolor="white", linewidth=1)

    ax.set(aspect="equal",
    xlim=(-5, 5), ylim=(-5, 5),
    xlabel="$X_1$", ylabel="$X_2$")

    #==============================================================================
    # So now let us visualize the Test set
    #==============================================================================
    plt.show()

    #------------------------------------------

    thank you so much in advance,
    Irene
     
    #2
  3. Irene Boudarov

    Irene Boudarov Active Member
    Alumni

    Joined:
    Jan 24, 2019
    Messages:
    22
    Likes Received:
    0
    Hi Kaustubh,

    This is Irene again here.

    I was doing one of the projects in ML and trying to split the dataset to train (80%) and test (20%). After a split, I checked whether the split was accurately done and instead of splitting 80/20, it split 75/25. I was trying to debug for a long time and could not figure out why the split does not accurately split the way I want to 80/20. Here is my code:
    -----------------------------
    #### split housing dataset into train and test datasets: train=80% and test=20% #####
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

    # check that the train and test dataset were split correctly
    print("X_train size: ", X_train.shape, "; y_train size: ", y_train.shape)
    print("X_test size: ", X_test.shape, "; y_test size: ", y_test.shape)

    Output:
    X_train size: (16512, 10) ; y_train size: (16512,)
    X_test size: (4128, 10) ; y_test size: (4128,)

    4128/16512
    Output: 0.25

    -------------------------------------------------

    As you can see above that 25% of test data is different than what I was asking for (which as 20% of test data). Please let me know if there is anything wrong with this code and what is the reason train/test are not splitting correctly.

    BTW, what is the difference between the following lines when I am trying to split, which one should be used for what scenario:

    Option 1: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)
    Option 2: X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=2)

    thank you
    Irene
     
    #3
  4. Irene Boudarov

    Irene Boudarov Active Member
    Alumni

    Joined:
    Jan 24, 2019
    Messages:
    22
    Likes Received:
    0

    Hi Kaustubh,
    I just realized that I was calculating the % train/test wrong. I should have divided train over overall dataset shape:
    4128/20640
    Output: 0.20

    My second question still holds regarding: what is the difference between the following lines when I am trying to split, which one should be used for what scenario:

    Option 1: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)
    Option 2: X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=2)

    thank you
    Irene[/QUOTE]
     
    #4
  5. Irene Boudarov

    Irene Boudarov Active Member
    Alumni

    Joined:
    Jan 24, 2019
    Messages:
    22
    Likes Received:
    0
    Hi Kaustubh,

    I have started working on one of the projects (Project 4 - California Housing Price Prediction). I have questions about this project. Below are the list of questions, could you please help clarify them.


    • Question 3: There are two ways to encode categorical data, which of them do you expect to use in this project or in other words, what would be a better option for ML model:

      1. Encoding categorical data using LabelEncoder transforming ocean_proximity data to values 0,1,2,3,4 scale, OR

      2. Adding additional columns for each ocean_proximity and put 1 for the columns that apply and 0 that don’t apply. This method will add additional 5 columns: 1hr ocean, inland, island, near bay, near ocean
        In the Movielen project (from Data Science) I added additional columns to split the Genres categories before applying ML modeling.
    • Question 5: Standardize data - would this be using StandardScaler?

    • Question 6: Perform Linear Regression:

      1. Predict output for test dataset using fitted model:

        1. What is the meaning of fitted model? Are you referring to X_test data after standardizing it using StandardScaler?
      2. Should I perform Linear Regression on test data using one feature or all features? House pricing is the target and other columns are features, there are many features that affecting the house pricing, should I use all features or only one feature for Linear Regression prediction?
    • EDA and Feature Engineering: Generally, I noticed that there was no ask in this project for Exploratory Data Analysis (EDA) to analyze which features are affecting the output, shouldn’t we do this process before splitting the datasets and before the Machine Learning process? Also, what about the Feature Engineering in this project? Should we do that before splitting the data?

    • Visualization: I noticed also that there was not much as for visualization in this project. Are you not expecting to visualize the data before applying machine learning algorithms? I thought that a good practice is to first understand the data (by visualizing relationships between features and target) before applying the ML model. Please confirm whether any visualization is required for this project to see the relationship/correlation between features and between feature/s and/or target.
    Your help is greatly appreciated,

    thank you so much in advance,
    Irene
     
    #5
  6. Irene Boudarov

    Irene Boudarov Active Member
    Alumni

    Joined:
    Jan 24, 2019
    Messages:
    22
    Likes Received:
    0
    Hi Kaustubh,

    There are two ways to replace the missing values: 1) with fillna; or, 2) with Imputer

    Here is my code below and I got an error:
    ------------------------------------------------------
    #import Housing Dataset in a DataFrame
    housing = pd.read_excel('/home/irit/AI_Assignments/MachineLearning/ProjectsSubmission/1553768847_housing.xlsx')

    #summarize all the isnull values
    np.sum(housing.isnull())
    OUTPUT:

    longitude 0
    latitude 0
    housing_median_age 0
    total_rooms 0
    total_bedrooms 207
    population 0
    households 0
    median_income 0
    ocean_proximity 0
    median_house_value 0
    dtype: int64​

    # impute the missing values with 'mean'
    from sklearn.preprocessing import Imputer
    mean_imputer = Imputer(missing_values = np.nan, strategy='mean', axis=1)
    mean_imputer = mean_imputer.fit(housing)
    imputed_housing = mean_imputer.transform(housing.values)
    housing = pd.DataFrame(data=imputed_housing, column = cols)
    housing.head(2)


    OUTPUT: I got a long error (please see screenshot attached)


    I got two errors when using housing.values
    upload_2019-8-14_14-47-52.png

    here is the error when using housing['total_bedrooms'].values
    attached my screenshot to show the error that I got when using Imputer to replace missing values:
    upload_2019-8-14_14-42-26.png


    In addition, please confirm what are the values of col? are they supposed to be all of the columns in the dataset? Or, only the column where the missing values are?

    thank you
    Irene
     
    #6
  7. Anurag Sharma_1

    Anurag Sharma_1 New Member
    Alumni

    Joined:
    Nov 28, 2015
    Messages:
    1
    Likes Received:
    0
    Hi Kaustubh,

    Can you please share the Project# 3, with solution and other details on google drive. We discussed this project yesterday (30-Aug-2019) in our Machine Learning's last class (8:00 pm to 11:00 pm).

    Thanks & regards
    Anurag Sharma
     
    #7

Share This Page