Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Programming Basics and Data Analytics with Python | Anand | June 13 - July 18

Aparisim Saha

Active Member
Thank you Shashwat,

but still I am unable to understand what you are saying the below, actually I am from non programming background and learning it slowly.

1) Regarding the 'size_cleanser' error. While calling the user-defined function in the next executable line, you missed 'sizearg' which is mentioned in the error.

If possible please share me the first 6 lines of code.

Thanks in Advance.
you have to apply the function to a new variable or to the existing variable.
even i am not from programming background. i am also struggling to get into the same hat.
 
Hi Anand,

I am stuck at fitting the training data to Linear Regression Model. I have tried everything possible but I keep getting the below error.

I have also tried without sample_weight. But the same error pops up. This is a stoppage not a hurdle. If I cannot pass this step, I cannot complete the project.

upload_2020-7-16_23-47-40.png

Please help.
 

_79419

Member
Dear Anand,

While Applying get dummies for Column Type as attached int screenshot I am getting 0 for Paid and 1 for Free. Is there a way to reverse it. I mean - If I need like 1 for Paid and 0 for Free.

Regards,
Sathish Kumarpnd2.png pnd2.png
 

anand.s.subramaniam

Well-Known Member
Alumni
Dear Anand,

While Applying get dummies for Column Type as attached int screenshot I am getting 0 for Paid and 1 for Free. Is there a way to reverse it. I mean - If I need like 1 for Paid and 0 for Free.

Regards,
Sathish KumarView attachment 10526 View attachment 10526



Hi
If you want to manually override what pd.get_dummies, then you do not need pd.getdummies at all. So, No, pd.get_dummies, does not allow you to override and change the mapping
 

anand.s.subramaniam

Well-Known Member
Alumni
Hi Anand,

I am stuck at fitting the training data to Linear Regression Model. I have tried everything possible but I keep getting the below error.

I have also tried without sample_weight. But the same error pops up. This is a stoppage not a hurdle. If I cannot pass this step, I cannot complete the project.

View attachment 10520

Please help.


It clearly says, the fit function is missing the positional argument y. please recheck how you did the train and test split. share that entire code with me if you are not able to solve yourself
 

anand.s.subramaniam

Well-Known Member
Alumni
Hi Anand

Subject = Project Playstore

1. For treating size column I've found a simple solution , Please check if below code is applicable in our project
s = dfplay1['Size']
dfplay1['Size'] = s.str.extract('(\d+)').astype(float).\
mask(s.str.contains('M', na = False), lambda x: x * 1000)

It worked Like a MAGIC !


2. To check "Reviews should not be more than installs as only those who installed can review the app."
I've used syntax as below and it worked - dfplay1[(dfplay1['Reviews']) > (dfplay1['Installs'])]

The question is How can I Replace above values of Reviews with the value of Installs from same index ? I dont want to drop this many records

Please help me out


Regards
Pinak



Pinak,
1) If it worked , please go ahead and use it. as i said =, there are multiple ways of solving a problem. i have mine, your have yours :)

2) extract review and installs into a separate data frame and apply the logic using .iloc
 

Pinak Das

Member
Hi Pinak,

There are only 11 rows where the reviews are more than installs. So, you can drop them. But I did not drop them. I changed the values of reviews equivalent to the corresponding values of installs.
What I did:
1) I created a new column with 2 categories, 'correct' and 'incorrect'. If the value in 'Reviews' is less than or equal to the value in 'Installs', it would read 'correct' otherwise 'incorrect'.
2) Used a groupby method to change the values of reviews with installs.
3) I dropped the new column.
np.where . It worked , thanks Shashwat
 
It clearly says, the fit function is missing the positional argument y. please recheck how you did the train and test split. share that entire code with me if you are not able to solve yourself


Hi Anand,

I defined the Independent Variables using the below code:
x = inp2.iloc[:,1:8]
where, inp2 = New Data Frame after generating dummy variables
The columns, x contains are: Reviews, Installs, Type, Price, Reviews_Group, Conv_Size, Reviews_Log and Installs_Log

I defined the Dependent Variable using the below code:
y = inp2['Rating']

Then, I have used the below commands:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state=10)
from sklearn.linear_model import LinearRegression
LinearRegression.fit(x_train, y_train, sample_weight=None)

I also tried LinearRegression.fit(x_train, y_train) but the error is same.


Please help.
 

anand.s.subramaniam

Well-Known Member
Alumni
Hi Anand,

I defined the Independent Variables using the below code:
x = inp2.iloc[:,1:8]
where, inp2 = New Data Frame after generating dummy variables
The columns, x contains are: Reviews, Installs, Type, Price, Reviews_Group, Conv_Size, Reviews_Log and Installs_Log

I defined the Dependent Variable using the below code:
y = inp2['Rating']

Then, I have used the below commands:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.30, random_state=10)
from sklearn.linear_model import LinearRegression
LinearRegression.fit(x_train, y_train, sample_weight=None)

I also tried LinearRegression.fit(x_train, y_train) but the error is same.


I can only say, what went wrong, if i look at the code.

Meanwhile, can you try using the same approach that i did , in my Linear Regression notebook and let me know if it works.
 

_79419

Member
Dear Anand,

My RMSE Value is 0.48. So as you have mentioned in class - when RMSE is less than one, this typically means that model has good accuracy

So I thought my model worked out fine.

But my lm Score & r2 Score is 0.125 (so far from 1). Can you suggest what should I do here.

Also one more query while submitting project do we have any threshold that our model should be x% accurate only then satisfies the submission criteria kind off?

Regards,
SathishKumar
 
Hi Anand,

I have a problem in reading the file getting below error trying since yesterday.Please could you help, my computer has crashed so couldnot practice anything these days.

k: pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows() pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows() pandas\_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error() ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2

Regards
Aparna
 
Hi Anand,

I used your notes only to copy the commands.

Please see the attachment for the codes.


Hi Anand,

I thought just because I took a categorical (string) column in the dataset for independent variables, I was getting an error. Then, I simply took all the numerical columns but I am still getting the same error. I have clearly defined y.

Please check the attachment.
 

Attachments

  • LM Error1.pdf
    254.5 KB · Views: 9

_79494

Member
Error1.png Error2.png Hello Raghavendra,

I'm unable to upload my project source code for final submission.
My attachement size is around 4MB, and the website says it allows attachments upto 10MB.
I've tried uploading zip files, embedding by .ipynb file into Word, ppt , pdf documents.
None of these things work.
Can you please let me know how i can submit my project?

Thanks,
Balachandar.
 
Last edited:
View attachment 10569 View attachment 10570 Hello Raghavendra,

I'm unable to upload my project source code for final submission.
My attachement size is around 4MB, and the website says it allows attachments upto 10MB.
I've tried uploading zip files, embedding by .ipynb file into Word, ppt , pdf documents.
None of these things work.
Can you please let me know how i can submit my project?

Thanks,
Balachandar.



Hi Anand,

I am also facing the same issue
 
View attachment 10569 View attachment 10570 Hello Raghavendra,

I'm unable to upload my project source code for final submission.
My attachement size is around 4MB, and the website says it allows attachments upto 10MB.
I've tried uploading zip files, embedding by .ipynb file into Word, ppt , pdf documents.
None of these things work.
Can you please let me know how i can submit my project?

Thanks,
Balachandar.


Hi Bala,

We have to submit our notebook converted into pdf in the screenshots section as well...right?
 
Hi Anand,

I thought just because I took a categorical (string) column in the dataset for independent variables, I was getting an error. Then, I simply took all the numerical columns but I am still getting the same error. I have clearly defined y.

Please check the attachment.


Kindly create an instance first.

lm = LinearRegression()
lm.fit(x_train,y_train)

Hope this will work
 

anand.s.subramaniam

Well-Known Member
Alumni
Dear Anand,

My RMSE Value is 0.48. So as you have mentioned in class - when RMSE is less than one, this typically means that model has good accuracy

So I thought my model worked out fine.

But my lm Score & r2 Score is 0.125 (so far from 1). Can you suggest what should I do here.

Also one more query while submitting project do we have any threshold that our model should be x% accurate only then satisfies the submission criteria kind off?


Regards,
SathishKumar



Hi Sathishkumar,
As you already know

MSE = Sum( Square of (actual value - predicted value) /total observations)

RMSE = Sqrt(MSE)

R2 is accuracy = 1 - Error = 1 - Sum of Squared Errors / Sum of squared totals
= 1 - ( Sum (Square ( Actualy- predy)/ (Sum (Square( Actual Y - mean of y))


General trend is, when RMSE is low, the R2 value is high.

But, this is not always the case. There could be many cases where RMSE is low and R2 is also low.

The reasons for that could be the below

1. The sample size
2. The number of independent variables that are explaining the dependent variable are not enough and ned to be incerased
3. The data is not cleaned well and still has lot of missing values, outliers, duplicates. and a few other cases too.


What we need to understand is?

Just because R2 score is low, it does not mean that the model is entirely waste. different businesses can have different thresholds for R2..

Here is a good link, if you want to get detailed about it. This is an advanced topic and will need more time

https://blog.minitab.com/blog/adven...ion-model-with-low-r-squared-and-low-p-values

https://www.researchgate.net/post/W...ient_in_the_same_time_And_why_may_this_happen



Also one more query while submitting project do we have any threshold that our model should be x% accurate only then satisfies the submission criteria kind off?
We Do Not have any threshold on accuracy. What is expected is, have you tried enough to make your model better. if you are able to show that in your work, i think thats enough
 

anand.s.subramaniam

Well-Known Member
Alumni
Ideally, you should be able to submit. If its difficult, can you upload your notebook to the jupyter lab in simpli learn and try to submit from there?

if that also does not work, please raise a ticket at Simplilearn
 

_79494

Member
Hi Bala,

We have to submit our notebook converted into pdf in the screenshots section as well...right?

Hello Utkarsh,

I did not convert my jupyter notebook to pdf. I embedded my jupyter notebook into a word document for source code upload.
For write-up I think it does not matter if it is either pdf / .doc i guess, so i did a pdf for my writeup.
For sceenshots, i only uploaded a few png images from my EDA and feature engineering showing key areas that we learnt during our course and results of model prediction.

Hope this helps.

Regards,
Balachandar
 
Last edited:

_79494

Member
Hi Anand,

I am also facing the same issue
Hello Utkarsh,

Honestly i don't know why, all this while i thought it was only with me.

However, i found a workaround that atleast works for me.

-> What i did was that my jupyter notebook originally was of Size 6MB, later i tried to reduce the resolution of some of my EDA images and shrank it to around 3MB and this also was not sufficient.
-> Then i manually fixed the size of each image using 'figsize' and '
sns.set(font_scale=XXX)' suitably ( default font size is 1 ) so that it is clear and at the same time not too big. This shrunk my ipynb file to around 1.6MB.


Finally i was able to upload this size(slightly less than 2MB) to portal.

I did face this problem whist submitting another project, for which i raised a ticket.
So, in short , for me any file size > 3MB did not work ( coz i tried different methods like zipping jupyter notebook and embedding .zip version to doc file etc)

I did not want to take a chance with Data Science project as today would be our last class.

You could contact Simplilearn or try the approach suggested by Anand in this thread ( to try in jupyter labs and submit from there ) or mine.
P.S : I have not tried the labs approach as some times the browser hangs when doing resource heavy computation in labs and i'm more comfortable with installation of anaconda on my system.

Hope the answer helps and you are able to upload your project too.

Thanks,
Balachandar.
 
Last edited:

VAISHAK VINOD

New Member
  1. Average rating should be between 1 and 5 as only these values are allowed on the play store. Drop the rows that have a value outside this range.
    df[(df['Rating']<=0)&(df['Rating']>=5)].shape
  2. For free apps (type = “Free”), the price should not be >0. Drop any such rows.
    df[(df['Price']==0)&(df['Type']=='Free')].shape
Dear Sir,

whether these syntax i had written is good enough to meet out the two problem
 

_77176

Member
Hi Anand,

I am stuck at fitting the training data to Linear Regression Model. I have tried everything possible but I keep getting the below error.

I have also tried without sample_weight. But the same error pops up. This is a stoppage not a hurdle. If I cannot pass this step, I cannot complete the project.

View attachment 10520

Please help.
X=inp2[inp2.columns.drop('Rating')]. Hope you have dropped Rating column. Also remove the 3rd parameter, Best look at how Sir has done in the class. Do exacty and it will go through
 
Hello Anand,

below is my function and output for Size column.

my query is --- when I am trying to create a new column facing an error. Please suggest me.

Function:
upload_2020-7-18_20-41-28.png

Output:
upload_2020-7-18_20-42-29.png

Error while creating a new column:
upload_2020-7-18_20-44-13.png

Thanks in Advance.
 
X=inp2[inp2.columns.drop('Rating')]. Hope you have dropped Rating column. Also remove the 3rd parameter, Best look at how Sir has done in the class. Do exacty and it will go through


Of course, I had dropped column 'Rating'. The problem was with y not with x. Anyways, it is resolved.
 
Hi all,

Looks like most of you are still facing the issue to fix the column 'Size'.

What I did:
1) I created a user-defined function to split 'M' and 'k' and stored the output in a new column.
2) Then, I converted 'Varies with device' from the new column with Null Values.
3) Then, I replaced the Null values with the medians of 4 groups that I created for column 'Reviews'.

Below are the codes for the above 2 steps:
upload_2020-7-19_16-10-41.png
upload_2020-7-19_16-11-9.png


Hope this helps.

Thanks,
Shashwat Samrat Paul
 

Pinak Das

Member
Hi Guys

I'm getting pretty low correlation between Rating and other Indep ver . are you all facing more or less like this ? please share your correlation chart


upload_2020-7-19_22-51-52.png

Thanks
Pinak
 
Hi All,

I am not able to extract the notebook in PDF getting "500 : Internal Server Error" does anyone face this issue even if i Installed the Pconda. Can someone please help me to solve the issue.

Thanks in advance,
Mahadev
 

Attachments

  • Untitled.png
    Untitled.png
    18.8 KB · Views: 14

_79124

New Member
Hi Anand

I submitted my project on Monday. I got RMSE as 1.5e-15 and r2score as 1. I am doing python and data analytics for the first time. So is it practically possible to get r2score=1 or anything went wrong :) . I just applied all our class works step by step as i don't have much experience in it.

Cherian Paul
 

_77926

New Member
Hello Anand and friends
For Get dummy columns for Category, Genres, and Content Rating.
cat_cols = ['Category', 'Genres', 'Content Rating']
inp2 = pd.get_dummies(inp1, columns=cat_cols)
This gives me 166 columns !
Could you guide on what should be the correct code and expected no of columns

If one wants to generate corr() for only some columns Rating, Size, Price could someone share some hints
 

Attachments

  • pd_Dummies.png
    pd_Dummies.png
    18.4 KB · Views: 14
Hello Anand and friends
For Get dummy columns for Category, Genres, and Content Rating.
cat_cols = ['Category', 'Genres', 'Content Rating']
inp2 = pd.get_dummies(inp1, columns=cat_cols)
This gives me 166 columns !
Could you guide on what should be the correct code and expected no of columns

If one wants to generate corr() for only some columns Rating, Size, Price could someone share some hints


get_dummies will create n number of columns equivalent to the number of unique values present in the target column. If you use dropFirst = True, it will create n-1 columns. It is not an error.
 
Top