Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Programming Basics and Data Analytics with Python|Deepak|25th Jul - 29 th Aug

hey Hi,
I done like following if it works for you it will be great-

# Cleaning Categories into integers

CategoryString = inp_1["Category"]
categoryVal = inp_1["Category"].unique()
categoryValCount = len(categoryVal)
category_dict = {}
for i in range(0,categoryValCount):
category_dict[categoryVal] = i
inp_1["Category"] = inp_1["Category"].map(category_dict).astype(float)

# Converting of content rating classification

RatingL = inp_1['Content Rating'].unique()
RatingDict = {}
for i in range(len(RatingL)):
RatingDict[RatingL] = i
inp_1['Content Rating'] = inp_1['Content Rating'].map(RatingDict).astype(int)

# Converting of genres to float

GenresL = df.Genres.unique()
GenresDict = {}
for i in range(len(GenresL)):
GenresDict[GenresL] = i
inp_1['Genres'] = inp_1['Genres'].map(GenresDict).astype(int)

first do above task then copy the dataframe and ignore columns

Try at your end mine worked very well.
hey Hi,
I done like following if it works for you it will be great-

# Cleaning Categories into integers

CategoryString = inp_1["Category"]
categoryVal = inp_1["Category"].unique()
categoryValCount = len(categoryVal)
category_dict = {}
for i in range(0,categoryValCount):
category_dict[categoryVal] = i
inp_1["Category"] = inp_1["Category"].map(category_dict).astype(float)

# Converting of content rating classification

RatingL = inp_1['Content Rating'].unique()
RatingDict = {}
for i in range(len(RatingL)):
RatingDict[RatingL] = i
inp_1['Content Rating'] = inp_1['Content Rating'].map(RatingDict).astype(int)

# Converting of genres to float

GenresL = df.Genres.unique()
GenresDict = {}
for i in range(len(GenresL)):
GenresDict[GenresL] = i
inp_1['Genres'] = inp_1['Genres'].map(GenresDict).astype(int)

first do above task then copy the dataframe and ignore columns

Try at your end mine worked very well.


Thanks so much for your help Shrikant. It worked for me as well.
 

Maja Nikolova

Active Member
Hi Shrikant,

does it mean that we do not have to do the dummy variables but just convert to float and integers and then we can put these 3 variables in the model?

Thanks,
Maja



hey Hi,
I done like following if it works for you it will be great-

# Cleaning Categories into integers

CategoryString = inp_1["Category"]
categoryVal = inp_1["Category"].unique()
categoryValCount = len(categoryVal)
category_dict = {}
for i in range(0,categoryValCount):
category_dict[categoryVal] = i
inp_1["Category"] = inp_1["Category"].map(category_dict).astype(float)

# Converting of content rating classification

RatingL = inp_1['Content Rating'].unique()
RatingDict = {}
for i in range(len(RatingL)):
RatingDict[RatingL] = i
inp_1['Content Rating'] = inp_1['Content Rating'].map(RatingDict).astype(int)

# Converting of genres to float

GenresL = df.Genres.unique()
GenresDict = {}
for i in range(len(GenresL)):
GenresDict[GenresL] = i
inp_1['Genres'] = inp_1['Genres'].map(GenresDict).astype(int)

first do above task then copy the dataframe and ignore columns

Try at your end mine worked very well.
 
Hi Shrikant,

does it mean that we do not have to do the dummy variables but just convert to float and integers and then we can put these 3 variables in the model?

Thanks,
Maja

You have to create dummy dataframe. But before crating dummy you just convert it to float / integer then create dummy.
when I crated dummy first and later converted into float / integer it was giving error the Category, Genres, Content rating. So i run first conversion then created dummy
 
Hey All,
I'm new to programming. Can anyone help me in completing my project on time.

Can I've the steps as in which condition to apply for each point.

Thanks in advance.

Regards,
Nishita.
 

Maja Nikolova

Active Member
Hi All,

point 11. (Model building) from the project: when building the model:

after calculating the intercept and the coefficient, which value did you take as X? (y=b0+b1x)
Based on what did you decide which value to take for X?


3.1. Decide a threshold as cutoff for outlier and drop records having values more than that.

As it was mentioned in the live class, I used the code below:

Q1 = df['Installs'].quantile (0.25)
Q3 =df['Installs'].quantile (0.75)
IQR = Q3 - Q1
print (IQR)

Result: 4990000.0

Is this correct? What did you have as a treshold?

Thanks,
Maja
 

Aman Dua

Member
Hi,

Can someone please tell me what should be done with "Varies with Device" in Size column.I mean what value should I replace it with .The mean of sizes or median or any other value.

Thanks in advance!!
 

_85192

Member
Hi , How do you convert the program to pdf? to upload the Project and submit it?
I am unable to install the pdconvert module.

This for me :- !pip install code2pdf does not work

Is there any other way?
 

Maja Nikolova

Active Member
Dear All,

I'd like to consult you about the following:

when I run the intercept and coef there are so many, probably for each variable.
How do you interpret that? I receive the following output:

lr.intercept_
lr.coef_
Out[74]:
array([[ 1.00000000e+00, 2.69462581e-16, -2.07480463e-16, ...,
8.63118525e-08, 8.63118519e-08, 8.63118520e-08],
[ 1.21307983e-14, 1.00000000e+00, 2.22044605e-16, ...,
1.75530638e-06, 1.75530638e-06, 1.75530638e-06],
[ 7.80777179e-16, 6.66133815e-16, 1.00000000e+00, ...,
8.57143901e-06, 8.57143901e-06, 8.57143901e-06],
...,
[-2.69232634e-17, -3.81639165e-17, 1.38777878e-17, ...,
8.33334271e-01, -1.66665729e-01, -1.66665729e-01],
[ 1.33218416e-17, 7.22349697e-18, -5.90890184e-18, ...,
-1.66666330e-01, 8.33333670e-01, -1.66666330e-01],
[ 1.67992341e-17, 6.83216775e-18, -6.82877962e-18, ...,
-1.66666625e-01, -1.66666625e-01, 8.33333375e-01]])

upload_2020-8-23_19-52-48.png

Also what predicitions do you receive?

Please share.

Many thanks in advance,
Maja
 

Ramya_79

New Member
Hi Aayushi,

I got stuck on following topics could plz help me out their -

Sanity checks:


1. Average rating should be between 1 and 5 as only these values are allowed on the play store. Drop the rows that have a value outside this range.
2. Reviews should not be more than installs as only those who installed can review the app. If there are any such records, drop them.
3. For free apps (type = “Free”), the price should not be >0. Drop any such rows.

And Thanks in advance. Waiting for your reply.

Hi Shrikant,
Were you able to solve this? I am stuck in these steps too. Will you be able to help?

Ramya
 
Hi Shrikant,
Were you able to solve this? I am stuck in these steps too. Will you be able to help?

Ramya

Hi Ramya,

As You stuck with issue Sanity Checking I also got stuck with that. as i did for this check where it work for you or not because my did absolutely fine.

5.1 Checking agv. rating between 1 to 5.
1. I converted Rating into float was -
df_1["Rating"] = df_1["Rating"].astype(float)
2. i check the unique values as code - df_1["Rating"].unique()
- reason for doing this was it gives me exact details about rating.
3. i check same unique values with original data too for same purpose.
4. i checked and compared it but I found that while dropping missing & duplicate values my higher rating got dropped. But still you are having check by label likes - df.loc() or df.iloc()
5. check by -
rating_1 = df_1[df_1["Rating] > 5].index
rating_1
OR
df_1.drop(df_1[df_1["Rating] > 5].index) = if any ratings more than 5 still exist it will be dropped.

5.2 Checking Reviews more than Installs - as follow i did check that out. -
1. checking index which has reviews more than installs by following code -
review_install = df_1[df_1["Reviews"] > df_1["Installs"]].index
review_install
2. dropping values of 5.2.1 from below code -
df_1.drop(review_install, inplace = True)

But before running above code make sure your respective columns are either float / int.

5.3 Checking Type(Free) = Price(0)
# Check value count of type by - df_1["Type"].value_counts()
1. then convert type into binary as following -

def type_cat(types):
if types == 'Free':
return 0
else:
return 1
# Mapping type_cat to Type
df_1["Type"] = df_1["Type"].map(type_cat).astype(float)

2. Then check Type("Free") (which is "0" now) is not equal to Price(0) by following way

np.unique(df_1.Type[0]) != np.unique(df_1.Price[0]

output : array ([False]) means their is no value for Type(Free) is equal to no value except Price(0)


Hope it will work for you.....
 
Hi Shrikant and Aayushi,

I am getting the below errors while data processing, especially with conversions to numeric and removing the extra characters. I am not sure what is going wrong.



upload_2020-8-29_22-12-42.png
 

Attachments

  • upload_2020-8-29_22-10-45.png
    upload_2020-8-29_22-10-45.png
    77 KB · Views: 11
  • upload_2020-8-29_22-12-1.png
    upload_2020-8-29_22-12-1.png
    91.5 KB · Views: 9

Deepa Mukherjee

New Member
Can someone tell me how TAX , RAD are highly correlated in this case?
Hi, as per my knowledge, the RAD is not the actual data. RAD is the index values( if you see the description of the RAD -
index of accessibility to radial highways). So, this correlation should not be considered as true correlation.
Also, we can check the correlation on the continuous data and RAD is not the continuous data.

I hope this helps
 
Top