Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Programming Basics and Data Analytics with Python|Deepak|25th Jul - 29 th Aug

Maja Nikolova

Active Member
Hi All,

can somebody please help me on the following:

I am trying to create a boxplot for a the "Price" field for the App Prediction Project, which is requested for the univariate analysis.

I used the same code as given in the course, but no plot is displayed:

import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot( y="Price", data=pd.melt(df))
plt.plot('Price')

I tried this option:
fig1, ax1 = plt.subplots()
ax1.set_title('Price')
ax1.boxplot('Price')

Moreover, for the histogram the code works: sns.distplot(df['Rating'],bins=100), but for the boxplot not.

Can please someone help me?
I was googling a lot and reviewing the course and used the same code but still the plot is not displayed at all in spyder (anaconda).

Many thanks in advance!
Maja
 

Maja Nikolova

Active Member
Hi
What is the error you are getting while run the boxplot?
Try this below code
sns.boxplot(Dataframename['Price'])


Thanks for the help!
Now the boxplot is displayed but without the box, could you please also let me know if it is correctly displayed? Since as I know from the course, the boxplot graph does not look like it is in my case, please see the screen shot below.
 

Attachments

  • boxplot.PNG
    boxplot.PNG
    6.7 KB · Views: 62
Thanks for the help!
Now the boxplot is displayed but without the box, could you please also let me know if it is correctly displayed? Since as I know from the course, the boxplot graph does not look like it is in my case, please see the screen shot below.

That is due to outliers in the 'Price', this means the difference between the minimum and maximum values in Price column is relatively large and the displayed output means that a lot of the values are close to zero which is why you are able to see only a line rather than a box.
 

_79969

Member
Hi,
What is wrong in below program?
When I input the integers as 23,43 & 9, I get output as 9 and not 43.
Is max function not supposed to provide the largest number?

#Python program to find the largest number among the three input numbers
print ("Enter 3 integers")
num = [input(),input(),input()]
print ("Maximum of ", num[0],num[1],num[2]," is :",max(num))
 
hello Ayushi Ma'am
Where can we find Python assignment you were going to put on drive? I am unable to find the short assignment you were supposed to put
 

_85970

New Member
Hi Aayushi,

Find the attached Assignment 1 Answers.

Thanks,
Sathyanarayanan.C
 

Attachments

  • ASSIGNMENT 1_answers.zip
    2.6 KB · Views: 55

_86604

Member
Hi Aayushi,
Please find attached Assignment_Answer

Thanks,
Cilambarasan
 

Attachments

  • Assignment_Answer.zip
    3.5 KB · Views: 32

_85192

Member
Hi Aayushi,

for the assignment divisible by 7 and not divisible by 5 I have used following logic:

list=[x for x in range(2000,3200) for questionaet,remainder in enumerate(divmod(x,7)) if remainder ==0]
bridge=[x for x in list for quo,rem in enumerate(divmod(x,5)) if rem !=0]
print(bridge,end=" ")

I am not sure what have I done wrong since the out put of Bridge which should have been all numbers divisble by 7 but not divisible by 5. prints divisible by 7 numbers twice and random not divisible 5 numbers also are printed. e.g. 2002 is printed twice and 2065 must not have been printed but this is also printed.

Output I get is as below:
[2002, 2002, 2009, 2009, 2016, 2016, 2023, 2023, 2030, 2037, 2037, 2044, 2044, 2051, 2051, 2058, 2058, 2065, 2072, 2072, 2079, 2079, 2086, 2086, 2093, 2093, 2100, 2107, 2107, 2114, 2114, 2121, 2121, 2128, 2128, 2135, 2142, 2142, 2149, 2149, 2156, 2156, 2163, 2163, 2170, 2177, 2177, 2184, 2184, 2191, 2191, 2198, 2198, 2205, 2212, 2212, 2219, 2219, 2226, 2226, 2233, 2233, 2240, 2247, 2247, 2254, 2254, 2261, 2261, 2268, 2268, 2275, 2282, 2282, 2289, 2289, 2296, 2296, 2303, 2303, 2310, 2317, 2317, 2324, 2324, 2331, 2331, 2338, 2338, 2345, 2352, 2352, 2359, 2359, 2366, 2366, 2373, 2373, 2380, 2387, 2387, 2394, 2394, 2401, 2401, 2408, 2408, 2415, 2422, 2422, 2429, 2429, 2436, 2436, 2443, 2443, 2450, 2457, 2457, 2464, 2464, 2471, 2471, 2478, 2478, 2485, 2492, 2492, 2499, 2499, 2506, 2506, 2513, 2513, 2520, 2527, 2527, 2534, 2534, 2541, 2541, 2548, 2548, 2555, 2562, 2562, 2569, 2569, 2576, 2576, 2583, 2583, 2590, 2597, 2597, 2604, 2604, 2611, 2611, 2618, 2618, 2625, 2632, 2632, 2639, 2639, 2646, 2646, 2653, 2653, 2660, 2667, 2667, 2674, 2674, 2681, 2681, 2688, 2688, 2695, 2702, 2702, 2709, 2709, 2716, 2716, 2723, 2723, 2730, 2737, 2737, 2744, 2744, 2751, 2751, 2758, 2758, 2765, 2772, 2772, 2779, 2779, 2786, 2786, 2793, 2793, 2800, 2807, 2807, 2814, 2814, 2821, 2821, 2828, 2828, 2835, 2842, 2842, 2849, 2849, 2856, 2856, 2863, 2863, 2870, 2877, 2877, 2884, 2884, 2891, 2891, 2898, 2898, 2905, 2912, 2912, 2919, 2919, 2926, 2926, 2933, 2933, 2940, 2947, 2947, 2954, 2954, 2961, 2961, 2968, 2968, 2975, 2982, 2982, 2989, 2989, 2996, 2996, 3003, 3003, 3010, 3017, 3017, 3024, 3024, 3031, 3031, 3038, 3038, 3045, 3052, 3052, 3059, 3059, 3066, 3066, 3073, 3073, 3080, 3087, 3087, 3094, 3094, 3101, 3101, 3108, 3108, 3115, 3122, 3122, 3129, 3129, 3136, 3136, 3143, 3143, 3150, 3157, 3157, 3164, 3164, 3171, 3171, 3178, 3178, 3185, 3192, 3192, 3199, 3199]

In the above if I use Set then the duplicates are removed, but still there are some numbers divisible by 5 are printed.

e.g.
list=[x for x in range(2000,3200) for questionaet,remainder in enumerate(divmod(x,7)) if remainder ==0]
bridge=set([x for x in list for quo,rem in enumerate(divmod(x,5)) if rem !=0])
print(bridge,end=" ")
 

sangeetha.s(3501176)

New Member
Alumni
Hi, I'm not able to access Jupyter Lab - getting ' 503 service unavailable ' error in Firefox browser. Have anyone faced a similar issue - what is the solution?
 

Maja Nikolova

Active Member
Hi All,

can you please help me with the following (the question is related to the App Rating Project):

1. when creating a boxplot for the variables Rating and Content Rating, the visual as attached is displayed.
How from the boxplot I can conclude if there is any difference in the ratings?
Moreover, how can I conclude if some types are liked better?

2. when creating boxplot for Category and Rating, the boxplot as attached is displayed.
How can I answer the question: which genre has the best rating? when I cannot read the data labels on the x-axis since they are one over another, I even rotated them vertically but even that did not help.

Can you please help me asap?

Many thanks in advance!
 

Attachments

  • Figure_1.png
    Figure_1.png
    12.7 KB · Views: 60
  • Figure_2.png
    Figure_2.png
    60.5 KB · Views: 56

Maja Nikolova

Active Member
Hi All,

since there is no example of log transformation throughout the course, I am facing some difficulties with the following:

  1. Reviews and Install have some values that are still relatively very high. Before building a linear regression model, you need to reduce the skew. Apply log transformation (np.log1p) to Reviews and Installs.
I wrote the following code (2 options):

1.
import numpy as np
in_array = inp1['Reviews', 'Installs']
print ("Input array : ", in_array)
out_array = np.log1p(in_array)
print ("Output array : ", out_array)

I receive the following error: KeyError: ('Reviews', 'Installs')
upload_2020-8-9_19-38-1.png

2.
inp1['Reviews_norm'] = np.log1p(inp1['Reviews'])
print ("Output array : ", 'Reviews_norm')

I receive the following result:
upload_2020-8-9_19-39-32.png

And I have only two values transformed to log. Why are not all of them transformed?

upload_2020-8-9_19-13-56.png


2.

inp1.drop(['App','Last Updated','Current Ver','Android Ver'],axis='columns', inplace = True)
 

Attachments

  • upload_2020-8-9_19-41-57.png
    upload_2020-8-9_19-41-57.png
    2.9 KB · Views: 56

Aayushi_6

Well-Known Member
Hi All,

since there is no example of log transformation throughout the course, I am facing some difficulties with the following:

  1. Reviews and Install have some values that are still relatively very high. Before building a linear regression model, you need to reduce the skew. Apply log transformation (np.log1p) to Reviews and Installs.
I wrote the following code (2 options):

1.
import numpy as np
in_array = inp1['Reviews', 'Installs']
print ("Input array : ", in_array)
out_array = np.log1p(in_array)
print ("Output array : ", out_array)

I receive the following error: KeyError: ('Reviews', 'Installs')
View attachment 10923

2.
inp1['Reviews_norm'] = np.log1p(inp1['Reviews'])
print ("Output array : ", 'Reviews_norm')

I receive the following result:
View attachment 10928

And I have only two values transformed to log. Why are not all of them transformed?

View attachment 10919


2.

inp1.drop(['App','Last Updated','Current Ver','Android Ver'],axis='columns', inplace = True)


Hi Maja,

numerics =['int16','int32','int64','float16','float32','float64']
for c in [c for c in df.columns if df[c].dtype in numerics]:
df[c]= np.log1p(df[c])

Can you try in this way if it works for you?
 

Aayushi_6

Well-Known Member
Hi All,

can you please help me with the following (the question is related to the App Rating Project):

1. when creating a boxplot for the variables Rating and Content Rating, the visual as attached is displayed.
How from the boxplot I can conclude if there is any difference in the ratings?
Moreover, how can I conclude if some types are liked better?

2. when creating boxplot for Category and Rating, the boxplot as attached is displayed.
How can I answer the question: which genre has the best rating? when I cannot read the data labels on the x-axis since they are one over another, I even rotated them vertically but even that did not help.

Can you please help me asap?

Many thanks in advance!

Hi,


Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness.

The box plot shape will show if a statistical data set is normally distributed or skewed.

bloxplots-skewed.jpg


When the median is in the middle of the box, and the whiskers are about the same on both sides of the box, then the distribution is symmetric.

When the median is closer to the bottom of the box, and if the whisker is shorter on the lower end of the box, then the distribution is positively skewed (skewed right).

When the median is closer to the top of the box, and if the whisker is shorter on the upper end of the box, then the distribution is negatively skewed (skewed left).

How to compare box plots


Box plots are a useful way to visualize differences among different samples or groups. They manage to provide a lot of statistical information, including — medians, ranges, and outliers.


Step 1: Compare the medians of box plots


Compare the respective medians of each box plot. If the median line of a box plot lies outside of the box of a comparison box plot, then there is likely to be a difference between the two groups.

overlapping-box.jpg





Step 2: Compare the interquartile ranges and whiskers of box plots


Compare the interquartile ranges (that is, the box lengths), to examine how the data is dispersed between each sample. The longer the box the more dispersed the data. The smaller the less dispersed the data.

compare-boxplots.jpg


Next, look at the overall spread as shown by the extreme values at the end of two whiskers. This shows the range of scores (another type of dispersion). Larger ranges indicate wider distribution, that is, more scattered data.



Step 3: Look for potential outliers (see above image)


When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.



Step 4: Look for signs of skewness


If the data do not appear to be symmetric, does each sample show the same kind of asymmetry?

box-plots-distribution.jpg



Hope it helps!
 

Maja Nikolova

Active Member
Hi Maja,

numerics =['int16','int32','int64','float16','float32','float64']
for c in [c for c in df.columns if df[c].dtype in numerics]:
df[c]= np.log1p(df[c])

Can you try in this way if it works for you?


Hi Aayushi,

thanks for the reply.

I tried the code but not sure if the change in the data, i.e. the data in the columns Reviews and Installs is log transformed.
Hereby I send a screen shot. Does the values in the Reviews and Installs columns are actually transformed into log?

upload_2020-8-11_21-0-48.png


Many thanks!
 

Attachments

  • upload_2020-8-11_20-58-27.png
    upload_2020-8-11_20-58-27.png
    13.6 KB · Views: 49

Maja Nikolova

Active Member
Hi,


Box plots are useful as they provide a visual summary of the data enabling researchers to quickly identify mean values, the dispersion of the data set, and signs of skewness.

The box plot shape will show if a statistical data set is normally distributed or skewed.

bloxplots-skewed.jpg


When the median is in the middle of the box, and the whiskers are about the same on both sides of the box, then the distribution is symmetric.

When the median is closer to the bottom of the box, and if the whisker is shorter on the lower end of the box, then the distribution is positively skewed (skewed right).

When the median is closer to the top of the box, and if the whisker is shorter on the upper end of the box, then the distribution is negatively skewed (skewed left).

How to compare box plots


Box plots are a useful way to visualize differences among different samples or groups. They manage to provide a lot of statistical information, including — medians, ranges, and outliers.


Step 1: Compare the medians of box plots


Compare the respective medians of each box plot. If the median line of a box plot lies outside of the box of a comparison box plot, then there is likely to be a difference between the two groups.

overlapping-box.jpg





Step 2: Compare the interquartile ranges and whiskers of box plots


Compare the interquartile ranges (that is, the box lengths), to examine how the data is dispersed between each sample. The longer the box the more dispersed the data. The smaller the less dispersed the data.

compare-boxplots.jpg


Next, look at the overall spread as shown by the extreme values at the end of two whiskers. This shows the range of scores (another type of dispersion). Larger ranges indicate wider distribution, that is, more scattered data.



Step 3: Look for potential outliers (see above image)


When reviewing a box plot, an outlier is defined as a data point that is located outside the whiskers of the box plot.



Step 4: Look for signs of skewness


If the data do not appear to be symmetric, does each sample show the same kind of asymmetry?

box-plots-distribution.jpg



Hope it helps!



Hi Aayushi,

thanks for the reply.

Yes, the interpretation is clear but I am facing another issue - as there are many boxplots on the visual, the name of the data labels on the X-axis cannot be read, please see the screen shot below.
I rotated the labels vertically but still they cannot be read and hereby I cannot interpret the visual correctly.
upload_2020-8-11_21-8-46.png


Any idea how this can be fixed?
 
use the below code:

plt.figure(figsize=(25,16)

This will increase the plot size and every data will be displayed properly.
 
Last edited:
Hi Maja,

numerics =['int16','int32','int64','float16','float32','float64']
for c in [c for c in df.columns if df[c].dtype in numerics]:
df[c]= np.log1p(df[c])

Can you try in this way if it works for you?

Hi Ayushi,

Does this code work? I do see the change in the result, though.

df4['Reviews']=np.log1p(df4['Reviews'])
df4['Installs']=np.log1p(df4['Installs'])
 
Last edited:
Hi Aayushi,

im looking for some help related to project for python.
During data cleaning related to question 4 ie
4. Variables seem to have incorrect type and inconsistent formatting.
You need to fix them:
1. Size column has sizes in Kb as well as Mb. To analyze, you’ll need to convert
these to numeric.
1. Extract the numeric value from the column
2. Multiply the value by 1,000, if size is mentioned in Mb


Im trying using
# #replace if any record didnt contain size in Kb / mb
df['Size'] = df['Size'].replace(r'[A-Z].*[a-z]$', '0', regex=True)
# #first to replace M/K with empty and depending on M value multiply with 1000.
df['Size'] = (df['Size'].replace(r'[mMkK]+$', '', regex=True).astype(float) * \
df['Size'].str.extract(r'[\d\.]+([mM]+)', expand=False)
.fillna(1).replace(['M'], [10**3]).astype(int))

but it seems second statement didnt work. can you please help me.
Thanks
 

Aayushi_6

Well-Known Member
Hi Aayushi,

thanks for the reply.

Yes, the interpretation is clear but I am facing another issue - as there are many boxplots on the visual, the name of the data labels on the X-axis cannot be read, please see the screen shot below.
I rotated the labels vertically but still they cannot be read and hereby I cannot interpret the visual correctly.
View attachment 10987


Any idea how this can be fixed?

Hi ,

Please increase the fig size to (12,12) or (15,15) or (20,20).
 
<div><br class="Apple-interchange-newline">from scipy.stats import mode Test['Item_Weight']=Test['Item_Weight'].fillna(mode(Test['Item_Weight'].mode[0])</div>
hi ayushi mam ,
I am getting this error for Big_mart data set ,can you please tell me what is the error .



from scipy.stats import mode
Test['Item_Weight']=Test['Item_Weight'].fillna(mode(Test['Item_Weight'].mode[0])



File "<ipython-input-65-b95b67570e95>", line 2
Test['Item_Weight']=Test['Item_Weight'].fillna(mode(Test['Item_Weight'].mode[0])
^
SyntaxError: unexpected EOF while parsing
 
Hi Aayushi,

I got stuck on following topics could plz help me out their -

Sanity checks:


1. Average rating should be between 1 and 5 as only these values are allowed on the play store. Drop the rows that have a value outside this range.
2. Reviews should not be more than installs as only those who installed can review the app. If there are any such records, drop them.
3. For free apps (type = “Free”), the price should not be >0. Drop any such rows.

And Thanks in advance. Waiting for your reply.
 
Hi Aayushi,

im looking for some help related to project for python.
During data cleaning related to question 4 ie
4. Variables seem to have incorrect type and inconsistent formatting.
You need to fix them:
1. Size column has sizes in Kb as well as Mb. To analyze, you’ll need to convert
these to numeric.
1. Extract the numeric value from the column
2. Multiply the value by 1,000, if size is mentioned in Mb


Im trying using
# #replace if any record didnt contain size in Kb / mb
df['Size'] = df['Size'].replace(r'[A-Z].*[a-z]$', '0', regex=True)
# #first to replace M/K with empty and depending on M value multiply with 1000.
df['Size'] = (df['Size'].replace(r'[mMkK]+$', '', regex=True).astype(float) * \
df['Size'].str.extract(r'[\d\.]+([mM]+)', expand=False)
.fillna(1).replace(['M'], [10**3]).astype(int))

but it seems second statement didnt work. can you please help me.
Thanks


Use this mine worked, and may be it work for you

# Size column reshaping
# scaling and cleaning size of installation

def change_size(size):
if "M" in size:
x = size[:-1]
x = float(x)*1000
return(x)
elif "k" in size:
x = size[:-1]
x = float(x)
return(x)
else:
return None
 

_83451

Member
Hi All:
I used below to change the size column however while checking post operation I could not see any changes in data frame. any guess how it can be corrected:

def reshape_size(size):
if "M" in size:
x = size[:-1]
x = float(x)*1000
return(x)
elif "k" in size:
x = size[:-1]
x = float(x)
return(x)
else:
return None
 

_83451

Member
Use this mine worked, and may be it work for you

# Size column reshaping
# scaling and cleaning size of installation

def change_size(size):
if "M" in size:
x = size[:-1]
x = float(x)*1000
return(x)
elif "k" in size:
x = size[:-1]
x = float(x)
return(x)
else:
return None

Hi Shrikant, I also used the same way however it is not reflecting change in data frame post this operation any suggestion?
 

Aayushi_6

Well-Known Member
<div><br class="Apple-interchange-newline">from scipy.stats import mode Test['Item_Weight']=Test['Item_Weight'].fillna(mode(Test['Item_Weight'].mode[0])</div>
hi ayushi mam ,
I am getting this error for Big_mart data set ,can you please tell me what is the error .



from scipy.stats import mode
Test['Item_Weight']=Test['Item_Weight'].fillna(mode(Test['Item_Weight'].mode[0])



File "<ipython-input-65-b95b67570e95>", line 2
Test['Item_Weight']=Test['Item_Weight'].fillna(mode(Test['Item_Weight'].mode[0])
^
SyntaxError: unexpected EOF while parsing

Hi Sushmita,

you have opened 2 parenthesis but closed only one of them.
 

Aayushi_6

Well-Known Member
Use this mine worked, and may be it work for you

# Size column reshaping
# scaling and cleaning size of installation

def change_size(size):
if "M" in size:
x = size[:-1]
x = float(x)*1000
return(x)
elif "k" in size:
x = size[:-1]
x = float(x)
return(x)
else:
return None

Thanks Shrikant for helping peers.
 

Maja Nikolova

Active Member
Hi Aayushi,

From the project:
4.3.1 Treat 1,000,000+ as 1,000,000 - is any code needed here or it is just for an information? What is meant/required exactly here?


5.1. Make boxplot for Ratings vs. Category. Which genre has the best ratings?

upload_2020-8-16_17-38-37.png

Most ratings are overlapping are overlapping, i.e. the distributions of Rating between the different Content Rating categories have a significant overlap, and so Content Rating would not be a good predictor of price. - am understating correct or? is it correct?


How to assess from the boxplot if this variable (Category) can be a good predictor for the model for the Rating? Is my interpretation below correct? (y = Rating, x = Category)

upload_2020-8-16_17-45-41.png


the genre Events has the best ratings - does the boxplot that is the highest means the best rating?. The distribution of the Ratings between the different types of Category is distinct enough to take Category as a potential good predictor of Ratings - am understating correct or? is it correct?


8.3. Get dummy columns for Category, Genres, and Content Rating. This needs to be done as the models do not understand categorical data, and all data should be numeric. Dummy encoding is one way to convert character fields to numeric. Name of dataframe should be inp2.

when the dummy columns are created for Category, there are 33 different dummy columns. (as on the screen shot)
upload_2020-8-16_17-43-39.png

I tried combining the categories and using the label encoding (like you said and like it is stated in the preprocessing file example 3) :

upload_2020-8-16_20-1-39.png


but the result that I receive is not a dummy variable:

upload_2020-8-16_20-2-24.png


Also, this is the problem with the other 2 variables (Genres and Content Rating) which also have many categories.

Many thanks in advance!
 
Last edited:

Aayushi_6

Well-Known Member
Hi Aayushi,

From the project:
4.3.1 Treat 1,000,000+ as 1,000,000 - is any code needed here or it is just for an information? What is meant/required exactly here?

You need to remove + sign from the numerical value via python code. So that this value is treated as integer not string.
5.1. Make boxplot for Ratings vs. Category. Which genre has the best ratings?

View attachment 11050

Most ratings are overlapping are overlapping, i.e. the distributions of Rating between the different Content Rating categories have a significant overlap, and so Content Rating would not be a good predictor of price. - am understating correct or? is it correct?
Yes

How to assess from the boxplot if this variable (Category) can be a good predictor for the model for the Rating? Is my interpretation below correct? (y = Rating, x = Category)

View attachment 11052


the genre Events has the best ratings - does the boxplot that is the highest means the best rating?. The distribution of the Ratings between the different types of Category is distinct enough to take Category as a potential good predictor of Ratings - am understating correct or? is it correct?

Yes

8.3. Get dummy columns for Category, Genres, and Content Rating. This needs to be done as the models do not understand categorical data, and all data should be numeric. Dummy encoding is one way to convert character fields to numeric. Name of dataframe should be inp2.

when the dummy columns are created for Category, there are 33 different dummy columns. (as on the screen shot)
View attachment 11051

I tried combining the categories and using the label encoding (like you said and like it is stated in the preprocessing file example 3) :

View attachment 11056


but the result that I receive is not a dummy variable:

View attachment 11057


Also, this is the problem with the other 2 variables (Genres and Content Rating) which also have many categories.

Many thanks in advance!

For task 8.3, your approach is wrong.
You are combining all the values of one column into one category. However, ideally what needs to be done is as follows:
For example:

dataframe column values = [ coffee, milk, mango, curd, tea, cold drinks, apple, banana,cheese ]

apply replace function to milk, cheese, curd ---- > dairy products
apply replace function to tea, coffee, cold drinks -----> drinks
apply replace function to apple, mango, banana -----> fruits

So, from initially 9 values now u have 3 values which are dairy products, drinks, fruits only.
Then convert these 3 values to dummy variables....finally getting 3 additional columns only.

Hope it helps!
 

_86631

Member
Hi all,

Can someone please help me with installations.
With python -m pip install pyforest, i am getting the following error:
ERROR: Could not find a version that satisfies the requirement pyforest (from versions: none)
ERROR: No matching distribution found for pyforest

Does it mean I don't have the correct version of pip insalled?
 

Maja Nikolova

Active Member
For task 8.3, your approach is wrong.
You are combining all the values of one column into one category. However, ideally what needs to be done is as follows:
For example:

dataframe column values = [ coffee, milk, mango, curd, tea, cold drinks, apple, banana,cheese ]

apply replace function to milk, cheese, curd ---- > dairy products
apply replace function to tea, coffee, cold drinks -----> drinks
apply replace function to apple, mango, banana -----> fruits

So, from initially 9 values now u have 3 values which are dairy products, drinks, fruits only.
Then convert these 3 values to dummy variables....finally getting 3 additional columns only.

Hope it helps!

Thanks Aayushi!


I am facing the following issue:

I separated the values in 3 categories

upload_2020-8-22_13-20-2.png

but in the dataframe all of them are named only with one same string - PRIVATE LIFE (until the very last one).

upload_2020-8-22_13-20-41.png

Is it acceptable to create only 1 category for all of the values in this Category variable? and then get only one dummy variable?

Because at the end, anyhow we will have 3 dummy variables for each of the columns, as it says in the project: Get dummy columns for Category, Genres, and Content Rating or am I missing something?

Many thanks!
 

Attachments

  • upload_2020-8-22_13-19-37.png
    upload_2020-8-22_13-19-37.png
    32.7 KB · Views: 47
Even i am having issues with 8.3. Categorizing Genres. There are so many Genres. Is there no other way of identifying the categories and encoding them ? If i look at the number of unique Genres values they are almost 120 unique values. I could use categorisation and label encoding for Category but for genres seems to be too cumbersome. Any other suggestion for dummy columns for Category, Genres, and Content Rating.
 
Even i am having issues with 8.3. Categorizing Genres. There are so many Genres. Is there no other way of identifying the categories and encoding them ? If i look at the number of unique Genres values they are almost 120 unique values. I could use categorisation and label encoding for Category but for genres seems to be too cumbersome. Any other suggestion for dummy columns for Category, Genres, and Content Rating.

hey Hi,
I done like following if it works for you it will be great-

# Cleaning Categories into integers

CategoryString = inp_1["Category"]
categoryVal = inp_1["Category"].unique()
categoryValCount = len(categoryVal)
category_dict = {}
for i in range(0,categoryValCount):
category_dict[categoryVal] = i
inp_1["Category"] = inp_1["Category"].map(category_dict).astype(float)

# Converting of content rating classification

RatingL = inp_1['Content Rating'].unique()
RatingDict = {}
for i in range(len(RatingL)):
RatingDict[RatingL] = i
inp_1['Content Rating'] = inp_1['Content Rating'].map(RatingDict).astype(int)

# Converting of genres to float

GenresL = df.Genres.unique()
GenresDict = {}
for i in range(len(GenresL)):
GenresDict[GenresL] = i
inp_1['Genres'] = inp_1['Genres'].map(GenresDict).astype(int)

first do above task then copy the dataframe and ignore columns

Try at your end mine worked very well.
 
Last edited:
Thanks Aayushi!


I am facing the following issue:

I separated the values in 3 categories

View attachment 11143

but in the dataframe all of them are named only with one same string - PRIVATE LIFE (until the very last one).

View attachment 11144

Is it acceptable to create only 1 category for all of the values in this Category variable? and then get only one dummy variable?

Because at the end, anyhow we will have 3 dummy variables for each of the columns, as it says in the project: Get dummy columns for Category, Genres, and Content Rating or am I missing something?

Many thanks!

Hey,

You have separated category in three subsets in one code. you have to run 3 separate codes for each subsets.
 
Hi Shrikant, I also used the same way however it is not reflecting change in data frame post this operation any suggestion?

hey Hi,
try as following -
# scaling and cleaning size of installation

def change_size(size):
if "M" in size:
x = size[:-1]
x = float(x)*1000
return(x)
elif "k" in size:
x = size[:-1]
x = float(x)
return(x)
else:
return None

# Change dtype ["Size"] to float

df_1["Size"] =df_1["Size"].map(change_size).astype(float)

# filling size which had NaN

df_1.Size.fillna(method = 'ffill', inplace = True)
 
Top