Separate names with a comma.
Recommended. Know people from your network.
Don't have an account?Sign up Now
To reset your password, enter the email address you registered with and we"ll send your instructions on their way.
Discussion in 'Big Data and Analytics' started by Nishant_Singh, Apr 25, 2020.
df.dropna(inplace = True) -> this is the correct syntax. worked for me
Awesome, Thanks Sakthi
do you have any solution for this now ?
Hi - no they never did. I got them to go down a point haha.
I tried both using the median for all the na rows that i could (some still were dropped if they had multiple columns with na). One thing I didnt try towards the end was dropping some of the numerical columns out of my dataset when building x_data for the model. I didnt do that because I just simply didnt see any good correlations at all (all either 10% or less). Since we got different numbers (i.e. your rscore being double mines), I do wonder what we did differently.
In size column, what should we do with the- "varies with device" ?
impute with median of the column
Impute those rows with a median of the column, that's what i tried and he too told in yesterday class
I'm stuck at step 11: linear regression in the project (App Rating) where python throws an error.
ValueError: could not convert string to float: 'TOOLS'
What i could think of this error - When creating dummies for category, genres, and content rating, there are two columns for 'Tools' (One from Category and other from Content Rating). I've also checked the data type - it is float, as I've set the dtype as float while creating dummies. But not able to go beyond this step. What am I missing?
Hello Anand Sir,
Have you uploaded the KNN file - .ipny? I am unable to locate it on Google Drive. I see all the other files on the topic that you have covered. If so, where will it be? If not, kindly do so the soonest. Will we able to look at your shared file for this class/session and for how long?
Please note live classes recording dtd. 24-05-2020 has no audio. Kindly upload the fresh recording at the earliest. Thanks.
HI Sakthi ,
Yeah we get around 160-167 columns , but lot of them with Nan values , so how to proceed . Should i drop or impute those Nan's with respective medians ?. And also i concatenated the inp1 and inp2, so do i need to drop any columns ?
Hi Anand ,
I am reviewing the last lecture (session 8) but there is no audio ? can you please check ?
Hi Sakthi ,
When i tried R2 score , system throwed me a compute error. Did you import any package for this.
Below is what i did :
#from sklearn.metrics import r2_score
Hi Sakthi ,
my r2 scores are same and i feel it is very low and i am not sure if it is correct or not. Apart from this , i identified below dependent and independent variables. What is the whole point in having dummy columns here ?. And also any input can you give on how to identify these independent variables atleast if you have so many ?
### DEFINE THE INDEPENDENT VARIABLES
### y = mx + b
X = inp1[['Reviews', 'Size','Price']]
### DEFINE THE DEPENDENT VARIABLE
y = inp1['Rating']
There is not audio for last class recording (Sunday 24th). Please help. Thank you
My r2_score(y_test,pred_y) is very low: 0.0518302688187805
I know I have to keep working on data by removing outliers and independent variables with vey high and low correlation, to increase the accuracy.
However, due the lack of time, it is ok if I submit the project with this value?
I hope the Get_dummies function won't create any NAN values. have you dropped all NAN values before you call get_dummies function? I got 156 columns and I did some columns reduction after creating a box plot and make some categories which had the same median as one category and again done a get_dummies from inp0 so I got around 126 columns.
Yes, I too did the same package import. can you post the error you got? I will try to help.
Make 'Rating' as a dependent variable 'y' and make independent of all other columns as x. selecting only 3 columns will not get you the good r2 score. Then, there is no point in doing dummies for categorical columns.
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(inp3, train_size = 0.7, random_state = 100)
y_train = df_train.pop("Rating")
X_train = df_train
y_test = df_test.pop("Rating")
X_test = df_test
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
from sklearn.metrics import r2_score
Try this code. you can find this code from the solution document for the practice project given in the course itself.
I submitted with a simillar score. I was told that it isnt about the score but moreso about the process you followed to get there, if its correct, so I think you will be ok.
One thing I recall in the process is checking for correlated data throughout the way. It simply was not correlating in this case so therefore I expect the final answers are supposed to be low. When you look at other projects like the bike project, you will see some high correlations there and different end results with the same process.
HI Vijaya Kumar ,
Thanks a lot for your response. My bad no NaN's got created. So now after dummies got created , i dropped the original columns of Category , Genres, Content_Rating. So in total i have around 139 columns. All the dummy column's has zero's and one's.
Thanks Vijaya Kumar. So x should contain all the columns as independent varables which means x = inp1.iloc[:,0:140] (139 Columns) and y-inp1['Rating']. Can you please confirm this part ?. Calculating R2 score has no issues now.
One last Question , we have a column called 'Type' which has values FREE (String's) because this will impact when we tried to fit the Regression model. Is it safe to drop or Need to perform anything ?
Hi, like others before me I get stuck on project no. 4.1. I have tried convert the string to int or float, but that results in all entries turning into 140933. This cannot be correct. Please could someone indicate how to impute / index / split the values in the column "Size". I don't remember that it was shown how to do this with a loaded csv-file. Thanks.
In which Format to upload the project in lms. I tried jupyter notebook conversion to pdf but it doesn't seem to work. I missed last class , so not sure what was discussed on this ?
As announced by you on the last day of the batch i.e. on 24th May 2020 that you will be sending all the participants a mail with instructions of how to post the feedback on the social networking sites. Have you sent that mail? I am asking you because i haven't seen any mail coming from you.
Hi Saurabh ,
Have you submitted the project ? Any idea about the deadline of the project ?
Either can you drop the Type columns or get dummies for that, both give the same results, so I dropped that column.
Yes you are right
After get_dummies for these columns, automatically the category genres and content_rating will drop from the table. so no issues if you dropped those manually.
Akshay, your approach is correct. pd.get_dummies() , will create one column for every value of a categorical column. We saw this in our logistic regression example.
if there is a column called gender , it has male and female, post pd.get_dummies(), you will see a separate column called male where the values will be 1 for every row that has M and 0 for every row that has Female.
Also, as seen in class, pd.get_dummies() has a parameter called drop_first , which can help in reducing the number of columns created.
Sedric et all,
For improving the accuracy of the models, one or all of the following approaches can be taken
1. data cleansing - remove missing values and any non use ful characters
2. impute them with mean or median if numeric. Here, please see if you can do some localised imputation. refer to the methodology we used for imputing age , in our logistic regression class. this will increase your score
3. Build new columns. for example , i see the following opportunities to create new features(Columns)
- create a rating category column (high rating (>4 and < 5), medium rating(>3 and <4) and low rating(>1 and <3)
4. You can build new features as categories around size , installs and price
5. From the last updated date, you can get some perspectives, like how recently was an updated. The more recently an app is updated, it would mean that they are listening to customers and updating their product at a faster rate. You can then use this to see, how the rating of the app is
6. Once the above is done, you will end up with many more columns. You can then apply correlation to find out which of the numeric columns are closer to dependent variable.
6.1 - in this , you will need to drop all the categorical columns which you converted to dummies. No point correlating them with your dep variable
7. Check for multi collinearity. meaning which of your independent variables are highly correlated with each other. You can use VIF that we learnt in our class.
7.1 Once you identify Idependent variables that have high VIF
7.2 You can choose those that have high correlation with Dep variable in your ML model
8. Split the test and training data
9. run your models using different models. even though we only tried one regression model, you all now have the code to try out other models.
just use the same code and see which of those regression models are doing good.
i am sure, your r2 score will improve post that.
Feroz, please check now. you should be able to get the audio
why jmy jupiter note getting struck whenever i am calling seaborn function??
Thank you Vijaya Kumar.
I have submitted the project on Monday 25th May 2020. The Deadline was 25th May 2020 till end of the day.
I have submitted the project.
Could you please let me know long it would take to be evaluated?
Sir, the recording has no audio yet. please check and upload it again. Meanwhile my project has got approved. Thank you for your guidance.
Sir, there is still no audio in last session
Still the same Anand. Kindly help !
Please check your code again. I ONLY see the value "TOOLS" under the category column. The "Content Rating" does not have any value called "TOOLS. so, unless, there is some code error, you will not be getting 2 columns for TOOLs
Seaborn gets stuck, only if you try to plot graph with a lot of data points. please share your code here. and i will see if i can help
Feroz, yes, you are correct. I think, we need to check back with Simplilearn on this. They must be able to reload this. Can you please raise a simplilearn ticket?
It is still the same despite raising ticket(s).
Still, there is no audio in the last lecture's recording. Please look into it Anand Sir and Simplilearn Team
It's been a long time i am stuck in a point where i have to omit all the outliers.
for the columns Price,Review and installs.
for the price i have been found a number of outliers from the boxplott and unable to get which values i need to omit.
My price filed summery is as fallows
Name: Price, dtype: float64
and after doing this operation like
Price = playstore_app_analysis1.loc[(playstore_app_analysis1["Price"] >= 0) & (playstore_app_analysis1["Price"] <200),['Price']]
the boxplot i am getting still have outstanding outliers
so, advice me what will be the appropriate step need to take in order to proceed further.
It is still the same even after 10 days are passed. Hope we will get the recording with proper audio. Thanks.
I raised a ticket too, however, it is an issue at the webex end. It might take some more time to get recording. I read through Anand Sir's Jupyter notebook on Logistic Regression and Principal Component Analysis and appeared for the exam and have cleared it plus my project got approved too.
Hi, I am getting the following error when I'm trying to create Linear Regression Model:
ValueError: array must not contain infs or NaNs
from sklearn.model_selection import train_test_split
df_train,df_test = train_test_split(inp1,test_size = 0.3)
df_train.replace([np.inf, -np.inf], np.nan)
df_test.replace([np.inf, -np.inf], np.nan)
X_train = df_train.drop('Rating',axis = 1)
y_train = df_train['Rating']
X_test = df_test.drop('Rating',axis = 1)
y_test = df_test['Rating']
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
I have tried removing all the NaNs and infs values but nothing works.