Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Data Science with Python | Pulkit Aneja | June 21

pulkitaneja

Active Member
Update 23 June 2021:
Codes for today's session have been uploaded to the google drive

Please go through the following self learning section.
1. Lesson 4 : Python Environment Setup and Essentials
2. Lesson 5 : Mathematical Computing with Python(Numpy)

Happy learning!! Please post here if you have any concerns
 

Vishal Upadhyay_3

Active Member
tuple1 =(12,'jack',45.5,'new',(3,2),'test')
tuple1[-1:-5]
()
why is it returning this instead of like this
tuple1[1:-1]
('jack', 45.5, 'new', (3, 2))

Please help me with this
 

pulkitaneja

Active Member
Blog for Jupyter Notebook Installation in Anaconda:

if this doesn't help, try searching on google : "install jupyter notebook using anaconda"
 

Vishal Upadhyay_3

Active Member
subjects = ['maths','science','english']
subject_numbers = ['one','two','three']

Here i did code same as in the video basic operator and function
total_subjects = zip(subjects,subject_numbers)
print(total_subjects)
<zip object at 0x000001F9BCFBF580>

when i do this
total_subjects = list(zip(subjects,subject_numbers))
print(total_subjects)
then the output is correct.
can you please explain sir why the first code isn't working
 
new_a_2d


Out[60]:
array([[ 2, 3, 5, 11],
[12, 8, 9, 4],
[ 6, 7, 10, 89]])

new_a_2d[1:3,1:3]

array([[ 8, 9],
[ 7, 10]])
for slicing second and third row of array slicing why we did not consider[1:3,2:3] instated [1:3,1:3], As [7 10] is 3rd row of matrix so it must be[ 2:3]
 
>>> a = np.array([[1, 2], [3, 4], [5, 6]])
>>> a
array([[1, 2],
[3, 4],
[5, 6]])
>>> np.compress([0, 1], a, axis=0)
array([[3, 4]])
>>> np.compress([False, True, True], a, axis=0)
array([[3, 4],
[5, 6]])
>>> np.compress([False, True], a, axis=1)
array([[2],
[4],
[6]])
sir, please explain this also
 

pulkitaneja

Active Member
Update 25 June 2021:
Code from today's session has been uploaded to the drive.

Homework:
1. Please complete all the homework mentioned inline in the Numpy_Introduction.ipynb file
2. Go to self learning section 5.10 and 5.12. Attempt those questions with the knowledge gained from today's class.

Have a great weekend and happy learning !
 
>>> a = np.array([[1, 2], [3, 4], [5, 6]])
>>> a
array([[1, 2],
[3, 4],
[5, 6]])
>>> np.compress([0, 1], a, axis=0)
array([[3, 4]])
>>> np.compress([False, True, True], a, axis=0)
array([[3, 4],
[5, 6]])
>>> np.compress([False, True], a, axis=1)
array([[2],
[4],
[6]])
sir, please explain this also
for this you need to know about the axis = 0,1 then you will get it....!!
axis =0 are rows and `1` are columns
 
subjects = ['maths','science','english']
subject_numbers = ['one','two','three']

Here i did code same as in the video basic operator and function
total_subjects = zip(subjects,subject_numbers)
print(total_subjects)
<zip object at 0x000001F9BCFBF580>

when i do this
total_subjects = list(zip(subjects,subject_numbers))
print(total_subjects)
then the output is correct.
can you please explain sir why the first code isn't working
change your `subjects` and `subject_numbers` from list to tuple you will get output [('one', 'maths'), ('two', 'science'), ('three', 'english')]
 

pulkitaneja

Active Member
subjects = ['maths','science','english']
subject_numbers = ['one','two','three']

Here i did code same as in the video basic operator and function
total_subjects = zip(subjects,subject_numbers)
print(total_subjects)
<zip object at 0x000001F9BCFBF580>

when i do this
total_subjects = list(zip(subjects,subject_numbers))
print(total_subjects)
then the output is correct.
can you please explain sir why the first code isn't working
Functions like zip have the same behavior as functions like range. They have their own objects. These special 'range' and 'zip' objects handle memory more efficiently by not generating the entire list at the same time but only the value needed at any point. They are created with the intent of performing iteration(using for, while loops, etc).
Therefore if you need to display their contents, you need to typecast their objects using list() function. This brings the entire object in memory which is not the case when they are used without list().
 

pulkitaneja

Active Member
tuple1 =(12,'jack',45.5,'new',(3,2),'test')
tuple1[-1:-5]
()
why is it returning this instead of like this
tuple1[1:-1]
('jack', 45.5, 'new', (3, 2))

Please help me with this
This is not working for the same reason that tuple1[5:0] will not work.
>>> x = [1,2,3,4,5]
>>> x[4:0]
[]
>>> x[4:0:-1] # syntax for reverse sorting
[5, 4, 3, 2]


use the principles shown in the above code snippet and try it out.

 

pulkitaneja

Active Member
for slicing second and third row of array slicing y we not consider[1:3,2:3] instated [1:3,1:3]
It really depends on the question in consideration. The row part of the array is managed by the first index in the array.
np_arr[row_index, column_index]
So for 2nd and 3rd row. the syntax will be [1:3, column_index]. In your suggestion you are simply changing the column index which has nothing to do with rows.

Please ask the question again if it hasn't been answered but try to be more articulate and provide proper examples. Write your question well so that others can also benefit from that.
 

Vishal Upadhyay_3

Active Member
e = pd.Series((1,'vishal',2,'vishal'))
e
0 1
1 vishal
2 2
3 vishal
dtype: object
type(e)
pandas.core.series.Series
so here i tried to convert tuple into pd.Series and it gave the output as expected

f = {'vishal',1,2,'akku',2}
t = pd.Series(f)
'set' type is unordered
why do we need to sort it and then we can convert that to pd.Series?
and when i converted the sorted set to pd.Series it showed this output
F1 = {1,1,1,2,4,3,5}
sorted(F1)
E1 = pd.Series([F1])
0 {1, 2, 3, 4, 5}
dtype: object
why did it took all the set as one instance or element.
and then can we make pd.Series directly like this
Q1 = pd.Series({1,2,3,4,5})
It's throwing an error
'set' type is unordered
 

Vishal Upadhyay_3

Active Member
GDP_1=pd.Series(['Algeria','Angola','Argentina','Australia','Austria'],index=[2255.225482,629.9553062,11601.63022,25306.82494,27266.40335])
and when i try to access the first element using GDP_1[0] it throws error and when i use the 2255.225482 as index value it gives Algeria.
But when I interchange the values of index and data then after using GDP_2[0] it will return 2255.225482 that is correct.
so my question is why at first the [0] indexing didn't worked and second why did it worked for second instance.
 
Hi Pulkit
when I am adding df = pd. data frame()
After import pandas its shows this error
NameError <ipython-input-170-227ec4996766> in <module>
----> 1 df=pd.dataframe()

NameError: name 'pd' is not defined

can you please help me how solve this error
 

Swati_6521

New Member
I downloaded the file, but the read_csv is not reading the csv file. Where should the file be placed and how to give the file path
 

Vishal Upadhyay_3

Active Member
if anyone is facing issue in accessing the csv file because of permission error or unnicode something like that
use r before the location of the csv file path and for permission error refer to this video
 
im unable to run program could please help
imbd = pd.read_csv("C:\\Users\\ShivaTeja\\Desktop")

FileNotFoundError Traceback (most recent call last)
<ipython-input-13-3559c9c22e99> in <module>
----> 1 imbd = pd.read_csv("C:\\Users\\ShivaTeja\\Desktop")

/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
674 )
675
--> 676 return _read(filepath_or_buffer, kwds)
677
678 parser_f.__name__ = name

/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
446
447 # Create the parser.
--> 448 parser = TextFileReader(fp_or_buf, **kwds)
449
450 if chunksize or iterator:

/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
878 self.options["has_index_names"] = kwds["has_index_names"]
879
--> 880 self._make_engine(self.engine)
881
882 def close(self):

/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1112 def _make_engine(self, engine="c"):
1113 if engine == "c":
-> 1114 self._engine = CParserWrapper(self.f, **self.options)
1115 else:
1116 if engine == "python":

/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1889 kwds["usecols"] = self.usecols
1890
-> 1891 self._reader = parsers.TextReader(src, **kwds)
1892 self.unnamed_cols = self._reader.unnamed_cols
1893

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File C:\Users\ShivaTeja\Desktop does not exist: 'C:\\Users\\ShivaTeja\\Desktop'
 

pulkitaneja

Active Member
GDP_1=pd.Series(['Algeria','Angola','Argentina','Australia','Austria'],index=[2255.225482,629.9553062,11601.63022,25306.82494,27266.40335])
and when i try to access the first element using GDP_1[0] it throws error and when i use the 2255.225482 as index value it gives Algeria.
But when I interchange the values of index and data then after using GDP_2[0] it will return 2255.225482 that is correct.
so my question is why at first the [0] indexing didn't worked and second why did it worked for second instance.
Great question. I believe after today's class, you will have the answer to your question. square brackets accessing without loc or iloc can have strange behaviors which are hard to comprehend. Using loc and iloc brings clarity by differentiating between natural index (0,1...) and labeled index.
When your indices are number ( case 1), 0 can mean either first value or some gdp value as well. Python cannot distinguish and thus throws an error. For case 2, since it is clear that index is based on countries name, 0 can be used to refer to index without confusion and thus you get the answer.

In conclusion, all hail loc and iloc :)
 

pulkitaneja

Active Member
e = pd.Series((1,'vishal',2,'vishal'))
e
0 1
1 vishal
2 2
3 vishal
dtype: object
type(e)
pandas.core.series.Series
so here i tried to convert tuple into pd.Series and it gave the output as expected

f = {'vishal',1,2,'akku',2}
t = pd.Series(f)
'set' type is unordered
why do we need to sort it and then we can convert that to pd.Series?
and when i converted the sorted set to pd.Series it showed this output
F1 = {1,1,1,2,4,3,5}
sorted(F1)
E1 = pd.Series([F1])
0 {1, 2, 3, 4, 5}
dtype: object
why did it took all the set as one instance or element.
and then can we make pd.Series directly like this
Q1 = pd.Series({1,2,3,4,5})
It's throwing an error
'set' type is unordered
The problem is in the part where you write [F1]. I believe you want to convert the set into a list but this is not how you do it.
[F1] just wraps the F1 object( a set) into a list such that F1 becomes the first element of that list.

What you should do instead is:
list(F1) to convert F! into a list

The reason why set is not allowed in series is because a set doesn't have index. And series objects need to have index. As a result you get an error while using set inside series.
 

pulkitaneja

Active Member
Update : 30 June 2021
The imdb dataset and notebook have been uploaded to the google drive. Use the platform of choice for executing notebook

Homework:
1. Replicate all the code from today's class. Play around with different conditions of loc and iloc.
2. Find out English movies with maximum profit percentage and maximum loss percentage
 

Tuhin Gayen

Active Member
Discussion on agrmax and idxmax.

After today's session i tried to look into the functionalites of argmax and idxmax, as there was some confusion in the class.
I found out that argmax give the true index, while idxmax ives the labelled index. So argmax should be used with iloc and idxmax should be used with loc.
Here is an example i created for my own understanding. Hopefully this will be helpful for others too.


Dataset used: imdb_data
Objective: Find out the french movie with highest popularity

Step 1: import data
imdb= pd.read_csv(r"E:\Data Science Program\Python\Lesson Datasets\imdb_data.csv")
imdb.set_index('imdb_id', inplace =True) # Note: Doing this, so that we can understand the true index and labelled index clearly

Step 2: Subset french
French = imdb[imdb['original_language']== 'fr']

Step 3: Find true index of french for highest popularity using argmax
French.popularity.argmax() #Note: This gave me true index of max popularity within the 'French' Data subset and not the original 'imdb' data

Step 4: Find the entire row of the movie using iloc
French.iloc[French.popularity.argmax()]


Now after step 2, we can use idxmax alternatively and that will give the labeleed index

Step 3 (Alternative) : Find labelled index for frech with highest popularity using idxmax
French.popularity.idxmax() #Note: This gives te labelled index i.e. the corresponding 'imdb_id'

Step 4(alternate): Find the entire row of the movie using loc
French.loc[French.popularity.idxmax()]




Both works fine.Hope this helps. Please let me know if there is something else to it. :)
 
Last edited:

Vishal Upadhyay_3

Active Member
why isn't the in function working for the same data . Like i was trying to find the 'ko' in original_language and its showing me false as output even if the 'hi' column is present ? And if we can use 'in' to find the str in the column please tell me the code for it sir.thankyou
 

Vishal Upadhyay_3

Active Member
imdb_ko = imdb_new5[(imdb_new5['original_language']=='ko') & (imdb_new5.budget)>3300000]
why did it gave me blank output with only column names
imdb_ididtitlerelease_datehomepageoriginal_languageoriginal_titleoverviewpopularityruntimebudgetrevenue

and when i tried to do both of them seprately it worked just fine.
and
imbd_ko1 = imdb_new5[(imdb_new5['original_language']=='ko')&(imdb_new5.budget>3300000)]
what's the difference between the upper and lower code. I guess if you can tell me the reason it will help me in resolving my problem.
 

akarsh k

Administrator
Simplilearn Support
if anyone is facing issue in accessing the csv file because of permission error or unnicode something like that
use r before the location of the csv file path and for permission error refer to this video
Hi, Please raise ticket, Team will assist you with the best resolution possible.
 

Tuhin Gayen

Active Member
Hi Pulkit

So i was trying to explore the datetimein a business context.

Imagine i have a ecommerce sales dataset and i want to understand whiich are my peak order hours. So I can convert the date and time using the following:

dt =datetime.strptime("03-07-2020 5:37PM", "%d-%m-%Y %I:%M%p")
if i use dt.time() on this it gives me the time 17:37

Now how do i group them in like 10:00-11:00, 11:00-12:00, 12:00-13:00 etc?

Or can i try comparing using boolean logic? I tried it but its giving error.
 

Tuhin Gayen

Active Member
Hi Pulkit

I found a workaround for the datetime problem with imdb_data.

We can try an user-defined function like this:

def fix_year(x):
if (x>2020):
new_year =(x-100)
else:
new_year = x
return (new_year)

Then using apply, we can run the entire release _year column through this function.

Assuming the dataset is updated in 2020, we are defining the if condition for x>2020.

One problem here is that since we have only last 2 digits of the year, there may be a possible overlap for a movie released in 1917 and one in 2017. How do we solve those ones?
 
Last edited:

pulkitaneja

Active Member
Discussion on agrmax and idxmax.

After today's session i tried to look into the functionalites of argmax and idxmax, as there was some confusion in the class.
I found out that argmax give the true index, while idxmax ives the labelled index. So argmax should be used with iloc and idxmax should be used with loc.
Here is an example i created for my own understanding. Hopefully this will be helpful for others too.


Dataset used: imdb_data
Objective: Find out the french movie with highest popularity

Step 1: import data
imdb= pd.read_csv(r"E:\Data Science Program\Python\Lesson Datasets\imdb_data.csv")
imdb.set_index('imdb_id', inplace =True) # Note: Doing this, so that we can understand the true index and labelled index clearly

Step 2: Subset french
French = imdb[imdb['original_language']== 'fr']

Step 3: Find true index of french for highest popularity using argmax
French.popularity.argmax() #Note: This gave me true index of max popularity within the 'French' Data subset and not the original 'imdb' data

Step 4: Find the entire row of the movie using iloc
French.iloc[French.popularity.argmax()]


Now after step 2, we can use idxmax alternatively and that will give the labeleed index

Step 3 (Alternative) : Find labelled index for frech with highest popularity using idxmax
French.popularity.idxmax() #Note: This gives te labelled index i.e. the corresponding 'imdb_id'

Step 4(alternate): Find the entire row of the movie using loc
French.loc[French.popularity.idxmax()]




Both works fine.Hope this helps. Please let me know if there is something else to it. :)
Wonderfully written. Clearly explained along with the examples :)
 

pulkitaneja

Active Member
Hi Pulkit

I found a workaround for the datetime problem with imdb_data.

We can try an user-defined function like this:

def fix_year(x):
if (x>2020):
new_year =(x-100)
else:
new_year = x
return (new_year)

Then using apply, we can run the entire release _year column through this function.

Assuming the dataset is updated in 2020, we are defining the if condition for x>2020.

One problem here is that since we have only last 2 digits of the year, there may be a possible overlap for a movie released in 1917 and one in 2017. How do we solve those ones?
These datasets are playaround datasets and certainly will be having certain flaws. The question raised by you in the end is valid but I would suggest not to dig any deeper into it. Moreover, commercial movie releases didn't happen in 1917 I believe, so kind of safe to go with your solution.
 

pulkitaneja

Active Member
imdb_ko = imdb_new5[(imdb_new5['original_language']=='ko') & (imdb_new5.budget)>3300000]
why did it gave me blank output with only column names
imdb_ididtitlerelease_datehomepageoriginal_languageoriginal_titleoverviewpopularityruntimebudgetrevenue

and when i tried to do both of them seprately it worked just fine.
and
imbd_ko1 = imdb_new5[(imdb_new5['original_language']=='ko')&(imdb_new5.budget>3300000)]
what's the difference between the upper and lower code. I guess if you can tell me the reason it will help me in resolving my problem.
The syntax for multiple conditions is df[ (condtion1) & (condition2) ]
In your first code, you closed parenthesis before completing the condtion, 2nd code is correct as it closes parenthesis after comparison with value.

Getting a blank result without any error messages simply means that the condition you have applied does not encapsulate any movies i.e there are no korean movies with a budget greater than 3300000. Hence you are getting 0 rows in result.

You can try verifying it by looking at the highest budget korean movie and if it's budget is less than 3300000, you know what is happened :)
 

pulkitaneja

Active Member
Update 5th July:
Code from today's class has been uploaded to the google drive. [imdb_assignment(2).ipynb]

Please find below the reading material for merge and concat:

Group by:

Group by(Advanced Reading material. very advanced !!!)

Homework:
Practice pandas on the assignment problems from the self learning section. Please post if you are stuck with any particular question.
 

Rajat_Kumar

Administrator
Staff member
Simplilearn Support
Alumni
Update 5th July:
Code from today's class has been uploaded to the google drive. [imdb_assignment(2).ipynb]

Please find below the reading material for merge and concat:

Group by:

Group by(Advanced Reading material. very advanced !!!)

Homework:
Practice pandas on the assignment problems from the self learning section. Please post if you are stuck with any particular question.
Thank you Pulkit for helping the learners !
 

Vishal Upadhyay_3

Active Member
df_student = pd.DataFrame(result1set,columns=zip(*SQL_querry.description)[0])
here I'm trying to replicate the code in the video of pandas lesson 7.10. and im getting the 'zip' object is not subscriptable. why is it so? and i hardcoded the columns and it worked fine.
in video that code is running fine . so what I'm doing wrong in that code? please tell i wasted almost an hour before asking this question in checking and replicating the code.
 

Vishal Upadhyay_3

Active Member
sir i uploaded the first assignment for with the questions that where shown on the self learning section and found out that the real assignment is in the file that we upload . And now the submit button is not working as i submitted the wrong assignment.What to do now?
 

akarsh k

Administrator
Simplilearn Support
sir i uploaded the first assignment for with the questions that where shown on the self learning section and found out that the real assignment is in the file that we upload . And now the submit button is not working as i submitted the wrong assignment.What to do now?
Hi, Please raise ticket, Team will enable your submit option.

Please follow the steps below to raise
"Help and support" ticket.
>Login to LMS account,
>Select "help" icon on the top right hand side of the LMS page
>Select any query example:unlocking the certificate
>Connect to "Arya" the virtual assistant
>Select "other"
>To raise a ticket select " yes"
 
hi Pulkit,
I performed groupby function on imdb dataset, I grouped the movies by language after that I tried for profit and profit percent. But while doing this I found that out of 3000rows, total 812 rows having zero in budget column. so my questions are
1. Is it right to neglect these 812 rows during analysis??? because in lecture we tried same with HINDI lang. and in that case only 3rows having zero so we neglect it.

2. If I ignore these 812 rows and perform following operation
code: movie_anay=mov_lang_new.groupby('original_language')
movie_anay.agg(np.max)

I got output as follows, is it right????
1625575151581.png
 

pulkitaneja

Active Member
hi Pulkit,
I performed groupby function on imdb dataset, I grouped the movies by language after that I tried for profit and profit percent. But while doing this I found that out of 3000rows, total 812 rows having zero in budget column. so my questions are
1. Is it right to neglect these 812 rows during analysis??? because in lecture we tried same with HINDI lang. and in that case only 3rows having zero so we neglect it.

2. If I ignore these 812 rows and perform following operation
code: movie_anay=mov_lang_new.groupby('original_language')
movie_anay.agg(np.max)

I got output as follows, is it right????
View attachment 17865
Hi Kanchan. Thanks for this interesting take. This output cannot be correct. Like I mentioned in the group by class, doing a max() on group by object will give you the max of each column. Max(title) will simply mean the title which has the name starting with the highest letter. The budget and revenue that you see in next to 'Rumble in the Bronx' does not correspond to this movie but some other movie that had the highest revenue and budget. Even the budget and revenue do not correspond to the same movie.

So to get the movie with the highest profit. You need to do the group by on language and simply get the maximum profit movie for each language by running the following command
imdb.groupby('original_language')['profit'].max()
The output from the above can be merged(inner join) with the original dataset(use all columns if you like) on language and profit columns to get the corresponding title.

Key Learning: Running aggregate function on multiple columns after groupby operation doesn't necessarily give the desired result.
 

pulkitaneja

Active Member
df_student = pd.DataFrame(result1set,columns=zip(*SQL_querry.description)[0])
here I'm trying to replicate the code in the video of pandas lesson 7.10. and im getting the 'zip' object is not subscriptable. why is it so? and i hardcoded the columns and it worked fine.
in video that code is running fine . so what I'm doing wrong in that code? please tell i wasted almost an hour before asking this question in checking and replicating the code.
This topic has been discussed in one of your above questions only. Output of zip() and range() are of zip and range type objects. You can subscript and index only objects which are of list type or some related type like numpy array, series, etc. You need only a simple workaround. Just wrap the output of zip around list() function.
list(zip(whatever you are zipping))[perform indexing] should work.
 
Hi Pulkit,

My query what I tried asking was as below
while importing pyplot we can do it like
"import matplotlib.pyplot"

but when I try to import style in similar way
"import matplotlib.style"
It does not work
but
"from matplotlib import style"
works

So why can't we import style using the dot operator
Because as I understand style is also another class within matplotlib right?

Hopefully I could make my query clear now.
 
Can someone help in uploading the assignment?
1- Writeup - do we need to copy all the answers to the doc file and upload it?
2- screenshot - Do we need to take a screenshot of all the code and upload it?
3- Sourcecode - What we need to upload in the sourcecode?
 
Top