Separate names with a comma.
Recommended. Know people from your network.
Don't have an account?Sign up Now
To reset your password, enter the email address you registered with and we"ll send your instructions on their way.
Discussion in 'Big Data and Analytics' started by K Manoj, Jun 9, 2018.
*thread locked for this batch learners.
Please post your queries below..
Anand, Will client allow to download Python and use Python as a language for doing Data science?
Anand, in today's call you mentioned data scientist should do ML, But ML is a capabililty of an AI systems to understand the data and derive patterns from the data and predict outcomes. So would a data scientist be involved in designing the AI systems as well? Pls advice.
Anand, As Analytics maturity continuum, you mentioned descriptive, diagnostic, predictive and prescriptive. Is Cognitive Analytics next level into analytics maturity?
Thanks for the thread
Manoj, This thread seems to be different from the rest in the sense where do we post our queries. I want help from Anand on the launch of Jupyter from the Anaconda prommpt He said some conda. which I couldnt catch. thx,
Hi Manoj, Pls ignore my previous post. I have been able to install Py on my system. Its easy if you follow the steps from google
Hi Anand, Can you please share some data and steps to do the Hypothesis testing? And how to build an effective model based on the samples collected from the data.
If you can also share some case studies that may also help.
class 2 session has not been uploaded.Iam able to download only class 1 session.
Purushothaman, Cognitive Analytics as you may have known, tries to use the same behavior of human brain. Human brain functions based on nueral activity (neurons). The Neural Network aspect of data science builds algorithms based on functioning of nuerons (our brain cells).
The ANN, CNN (Artificial Neural Network), Convolutional Neural Network techniques build algorithms that learns in the same way as that of our brain. so, Cognitive analytics has its roots on Neural Networks. Both Cognitive Analytics and Neural Networks help with deep learning.
Deep learning techniques help in optimization.
So under optimization - you can group, deep learning (neural network, cog computing). optimization can also be thought of as AI. the word optimization is generic and hence can be confusing.
Please check the google link which has instructions to set up anaconda. Do let me know if you have issues.
By "Client" i hope you are referring to your employer or customer.
Python is an Open Source language that is now being adopted by many organizations. That said, there are some "enterprise software compliance teams", that monitor the installation of software . please check with them and they would definitely suggest you a way to install python at your work. Hope this response helps
I am not seeing the Class 2 downloadable link. Could you please upload it.
Hi Anand, I used the above. was able to download Anaconda , set up Jupyter and run programmes on Py. thx.
In the last class, I felt that we jumped into the tools like Anaconda and Jupyter without a complete introduction to package managers. I am not sure if its only me or if there are other like me who do not have any background in python. So, I did some basic study of these tools and I am summarizing it here.
Anand (or anybody), please correct\amend this information if I have got something wrong.
The way I understand it, there are 3 package\environment managers:
PIP - This is Python's package manager developed by MIT which runs on Python environment. PIP is a recursive acronym that can stand for either "Pip Installs Packages" or "Pip Installs Python". PIP installs any Python package in any environment.
PIP plagued by issues like
Does not perform all the dependency checks. One must read the package instructions (requirements.txt file) to understand the dependencies and install the pre-requisites. Without this a developer would face runtime errors in the program.
This is not an environment manager - This is most applicable to developers who could be maintaining different environments for data science, web development etc.
It affects the system python installation - This is applicable to Linux which comes with python installed in the system core. Packages installed directly affects the system python and any version specific programs or packages will be affected.
Conda- This was developed by Continum Analytics and is a cross platform package and environment manager. Conda installs any package within conda environment.
The advantages of Conda over PIP:
Takes care of and installs all the dependent packages, including non-python dependencies.
Allows installation, switching and management of different versions of packages
Anaconda Navigator (GUI tool) facilitates creating and managing different environments without having to worry with the nitty-gritties of package management.
It supports packages written in python, R etc. This is a general package manager.
Does not affect the system python
Very effective for data science projects; it brings in all the packages needed for data science and machine learning.
Anaconda - This a full distribution of the central software in the PyData ecosystem, and includes Python itself along with binaries for several hundred third-party open-source projects. Alternatively there is something called 'Miniconda' which contains the package manager conda only. Conda will subsequently need to be used to install other package from the scratch.
VirtualEnv - This is an environment manager which utilizes pip to manage packages create virtual environments. Helps manage the different packages and versions across virtual environments.
For hardcore developers, there is something called PyEnv, which encompasses both Anaconda and VirtualEnv allowing developers to manage their projects using both Ananconda and VirtualEnv. Additionally this ecosystem also allows developers to manage projects on different versions of python.
PIP can be used to install conda.
PIP can be used to install Jupyter
Conda is built upon PiP - it uses PIP under the hood.
Anaconda Navigator uses virtualenv under the hood to manage the environment
Pip vs Conda : Differences and Comparisons.
Which Python Package Manager Should You Use?
Jupyter Notebook- is a web-based interactive computational environment that allows you to run live code, embed visualization, explanatory text and even videos in one place. The embedded visualization reflects the changes in the data in real time. This combined with the power of word processing makes it a good notebook that has all the textual information, your code and immediate output, all in one place. It supports 40 programming languages, integration with big data, it can be shared using email, dropbox etc,
What is Jupyter Notebook?
Hi, please write to the support team. they will provide you the link. i will inform them as well
Thank you Sashi Kiran for taking time and sharing your notes and insights on package managers. My apologies that i did not cover these in detail in the class. will spend some time on this this week.
The reason i did not cover them in detail was because
1. jupyter installation via anaconda is very easy and a no brainer.
2. for starters of python , its better to go with one IDE than exploring everything. Hence i narrowed down on Jupyter which is both an ide used for learning and being used by enterprises as well.
3. Jupyter is a notebook that automatically helps you in learning best practices in python (indentation, comments, function doc strings etc)
No problem, lets use this form as effectively as possible.. I understand that you have a plan..
In Jupyter while typing python code it doesnt show any suggestions
eg: if we type "pr" it should show some suggestions starting from "pr" like print
any insights on this
Jupyter has extensive keyboard shortcuts that can be customized to help with "code completion."
one of the "code completion" features which is automatic with jupyter is,
1. when you press tab key after a command, keyword or method or function, jupyter will suggest you with options or complete the command. please find attached screen shots for below
below is an example, when i type pr and hit tab key, its showing me all the commands, keywords that start with "pr"
here is another example when i hit tab key after a "." , gives me all the methods that can be used
Hi Anand, We have completed Numpy and Pandas is in progress. Can you please share some case studies how these tools are helping Business in analytics and in real time problems?
I am getting the error when I am trying to run any command after importing numpy. Please let me know how to resolve this.
This error occurs when you have NOT run "import numpy as np" statement, but trying to run the statements following it.
please run the import numpy as np statement and then rerun the line with arr=np.array(my_list)
Numpy and Pandas are very useful in analysing data and there by business. In the life cycle of a data anlytics project numpy and pandas are quintessential for data acquisition, data wrangling and data exploration.
The use cases are everywhere. any business with data, can use numpy and pandas for anlysing the data.
one real time example that i worked on is i did a churn analysis for a leading chain of business in the beauty industry. They wanted me to
look at their data and solve the following problems
1. identify gaps in their business process and articulate the problem for them
2. customer churn
3. provide recommendations to them on how to retain clients
4. create a digital marketing strategy for them to do targeted marketing and increasing their revenue
As you can clearly see,
bullet 1 - is a descriptive analytics problem where the client did not clearly know what the problem was and asked me to come out with
bullets2 - is a typical problem that every client has and they threw that in to the foray.
2.1 to answer how clients are leaving them, i needed to do RFM (Recency, Freq, Monetary) analysis and Customer life time value
calculation. This helped me to come out with various segments of customers and their characteristics
2.3 Why clients are leaving. this is diagnostic analytics problem. We explored the data using statistical tools and
extracted insights to find out business insights like 20% of clients who leave , leave because of lack of connects from the company
bullet .2.2 - Prediction of Churn - Predictive Analytics.
When are clients leaving, was a difficult proposition for us, because in businesses like beauty industry, there is no track of
customer churn. we used statistical techinques like survival analysis to approximate the churn rate.
bullet3 & 4 - based off of 1 and 2, i had to predict when the clients are churning. this is a predictive analytics problem
is a prescriptive analytics problem where in, i provided recommendations for 1. customer retention 2. data quality up keep 3.
marketing strategy and 4. optimizing current process efficiencies
There are several other case studies in kaggle.com where real time business use cases are posted by companies and seek data scientists to solve their problems.
One such is https://www.kaggle.com/kaggle/sf-salaries.
Happy to Help
It will be great if we can also cover some solutions and examples for our further classes.
It works now. Thanks for the help.
Sure Girish. As part of this course, all of you are supposed to do a capstone project at the end. I ask of you to focus all your learning in such a way that your project gets done as per the design.
Once you complete this project offered by simplilearn, you will have good confidence on how to approach, plan and execute a data science project.
Then you can slowly hone your skills by taking up projects from kaggle.
Exception handling and python operator notebook is missing from the google drive link that you have shared with us.Can you upload these files.
Also one more doubt.For logical operations in python can we use & | in place of and or.
Please check again. Its there under specific folder called Exception Handling.
Yes Ekanth. They can be used.
where to post the doubts? we can do that here only?
What is the best way to read the large Excel / CSV files? If i am trying to read through jupyter notebook, but the browser is hanging.
actually i missed 2 class(16th june nd 17th june) of SAS advance which was taken by Prateek Sir.
so can you provide me SAS code of that classes
In the google drive link you have uploaded one of the files
Is it mandatory to go through this notebook with respect to data science.I don't have a experience with C++.
Hi Anand, I have few questions related to web scraping.
1- How we can get the data from any website in which the data is not in tabular format.
Example - How to get the data from Flipkart specific to products.
2. I was trying to get the data from one of the websites but was getting the below error.
data = pd.read_html('https://www.crunchbase.com/search/organization.companies')
I was trying to do web scraping from the site in which data is not in the tabular format. I have read the instructions from below but I am stuck at one point. I am not able to import the library as it says below message.
Also, I have seen in most of the web scraping projects the requirement is to fetch out the data even if it is not in tabular format. Can you share the steps to do that?
if in a data-set column have different type of values like integer, float and string.. how to convert all the column values to a unique data type?
Reading a file has got "Nothing to do" with the IDE. Every file can be read in every Python IDE. for reading a file you use the command
pd.read_csv('file name or path where the file is").
That said, i believe the issue you are facing is, NOT READING. But ACTUALLY PRINTING or displaying the dataframe after reading the file.
When you print the records of a large file, every IDE will hang and so does jupyter.
The best way to solve this problem is by asking the following questions
1.Why do i need to display all the contents of a large file in the IDE itself
Answer 1. For understanding the structure of the file
2 for understanding the pattern of data in the file
3. for trying to see the no of rows and columns in the file.
This is where we all need to start using the power of pandas. For the
- 1st point - use dataframe.info() to understand the structure.
- 2nd point - use dataframe.head() or dataframe.tail() to get the top 5 and bottom5 records
use dataframe.describe to get the max, min, 1st quart, 2nd quart and 3rd quart values of data
use dataframe.columns to get the columns of dataframe
- 3rd point use basic plots for understanding the pattern.
when you have so many tools, it really beats me as to why one would want to really see all the data in a file in the ide console itself.
Hope my answer helps
Can you be specific about your question. a dataset column will have different types of data like integer , float and string. by the virtue of these data types these columns types are actually unique. the values may not be unique.
So, can you please explain clearly whether you want unique values or unique datatypes in columns. please explain with an example.
Hi Girish, Thanks for trying. If you had read the instructions in my notebook, you would have noted the below
When installing Jupyter outside anaconda distribution, you will need to use the following install commands
Libraries that need to be installed with pip or conda
conda install sqlalchemy
conda install lxml
conda install html5lib
conta install BeautifulSoup4"
The above means that, in case you installed Jupyter note book outside anaconda, you might have to install beatifulsoup separately via anaconda prompt. The steps are below
1. Go to programs, ->Anaconda --> Anaconda prompt.
2. It opens a small console similar to windows cmd prompt
3. type conda install BeautifulSoup4
4.Once the conda command executes properly, come back to jupyter and execute your read_html command.
On your other question, its hard to believe that Enterprises scrape all data from a website irrespective of whether its tabular or non tabular. We all can come up with 1000s of reasons to do so. But let us accept that its an effort which will need a lot of data engineering after the scraping is
over. That said, here is a good link, which allows you to do what you want. scrape non tabular data from website. . I have not tried it out and i do not intend to try it out until i really have a need . But i presume this will satisfy your curiosity
look out for this specific chunk of code. please ensure to install the packages via conda install
from bs4 import BeautifulSoup
redditFile = urllib2.urlopen("http://www.reddit.com")
redditHtml = redditFile.read()
soup = BeautifulSoup(redditHtml)
redditAll = soup.find_all("a")
for links in soup.find_all('a'):
Good Try again. Always start with debugging the error. a simple google on Http error 416 will lead you to the below website. Mostly this happens when the data range sent in the request (by client - i.e your python code) is not enough to satisfy the data range that the server is responding.
Further exploration on how to solve this problem leads to the below link which is an interesting read..
Hi, My apologies. I am responsible for the Python class. Request you to kindly reach out to Simplilearn learner help team for catering to your request
Ekanth, Please see the above snapshot