Separate names with a comma.
Recommended. Know people from your network.
Don't have an account?Sign up Now
To reset your password, enter the email address you registered with and we"ll send your instructions on their way.
Discussion in 'Big Data and Analytics' started by K Manoj, Jun 5, 2018.
*thread locked for this batch learners.
Post your questions below.
how can we import terabytes of data in python using hadoop? Please demonstrate with an example .
Pritam appreciate the question. This question is specific to Data Engineering and not Machine Learning.
That being said, I am here to ensure your success. You can do it a few ways.
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data.
Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data.
Apache NiFi is based on technology previously called “Niagara Files” that was in development and used at scale within the NSA for the last 8 years and was made available to the Apache Software Foundation through the NSA Technology Transfer Program. Some of the use cases include, but are not limited to:
Big Data Ingest– Offers a simple, reliable and secure way to collect data streams.
IoAT Optimization– Allows organizations to overcome real world constraints such as limited or expensive bandwidth while ensuring data quality and reliability.
Compliance– Enables organizations to understand everything that happens to data in motion from its creation to its final resting place, which is particularly important for regulated industries that must retain and report on chain of custody.
Digital Security– Helps organizations collect large volumes of data from many sources and prioritize which data is brought back for analysis first, a critical capability given the time sensitivity of identifying security breaches.Source
Installation : Download , untar or unzip the package and modify conf/nifi.properties. I added nifi host and modified the port from 8080 to 9080 or deploy NiFi ambari service by using this
Nifi UI http://nifihost:9080/nifi/
We are going to work on 3 use cases. Part 1 is focusing very basic use case.
1) Copy files from local filesystem into HDFS
Processor - Remember this word because we will be playing with tons of processors while working on use cases. You will "drag" Processor on to the canvas. filter by "getfile" and click Add & then search "hdfs" for put.
2) We have GetFile and PutFile on to the canvas. Right click on the processor to see all the options. In this case, I am copying the data from /landing into HDFS /sourcedata. Right Click on the GetFile processor and it will give you the configuration option. Input directory /landing and in my case , I am keeping source file false.
3) let's configure PutHDFS. Add complete location of core-site.xml and hdfs-site.xml as shown below. You can label the processor as you like by clickingSettings and also, enable failure and success
4) let's setup the relationship between Get and Put. Drag that arrow with + sign to PutHDFS
Here is a step by step using Pandas.
What does categorization means in ML Techniques?
Thank you Eli. That was a prompt response.
It is what we covered in class- categorizing Machine learning
Here I am trying to categorize methods in a Data Scientist Toolbox. Most of the algorithms can be categorized based on two approaches:
Similarity in Function
This one will be focused on Learning Style.
Supervised Learning: Supervised learning is a type of machine learning algorithm that uses a known dataset called the training dataset to make predictions. The training dataset consist of input data and response values. The supervised learning algorithm builds a model based on the input data & response data. The generated model is usually validated using a test dataset. Techniques such as Linear & Non-Linear Regression, Classification Trees, Support Vector Machines are few examples of supervised learning techniques.
Unsupervised Learning: Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets with input data without any labelled responses. In other words, mathematically, the data can't be expressed in the Y=F(X) forms where X is the input data and the Y being the observed response. One of the most commonly used unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data. Most of these methods measure similarity between the data in a higher dimensional space using metrics such as Euclidean or probabilistic distance. Techniques such as Hierarchical Clustering, K-means Clustering are few examples of unsupervised learning techniques.
Semi-Supervised Learning: Semi-supervised learning algorithms belongs to a class algorithms that can learn from a combination of both labelled and unlabelled data. Obtaining large quantities of unlabelled data is cheap as opposed to developing training data sets. These kind of algorithms brings the strengths of both supervised & unsupervised learning techniques in one class of algorithms. Techniques such as self-training, generative models (eg of some of the generative model -Mixed Gaussian Distribution, Naive Bayes, Hidden Markov Models), Semi-Supervised SVM are few examples of Semi-Supervised Learning techniques.
Reinforced Learning: Reinforcement Learning belongs to a class of Machine Learning algorithms capable of automatically determining the ideal behaviour in a specific context that generates the maximize performance.This class of algorithms have a reward feedback loop helping the system to learn from it decisions, called reinforcement signals. Robotics is a area where this methods are used heavily. Temporal difference learning, Q-Learning techniques belongs to reinforced learning classes.
Hi, As per Eli's instruction I was trying to understand the basics of statistics and python by going through the online material for Data Science with Python, but could not find the hands on assignment for download for this module. Could you please share the same.
Hi Eli ,
Please help me understand the use of Bigquery in Python and how do i install it in Python preferably jupyter notebook. The reason behind saying this is because i am trying to work a dataset that has 40 different tables built on it and on the given example i can see that they have used Bigquery.
Just wanted to bring it attention that there is error in the problem statement pdf file in the California Housing Price Prediction folder .
The problem is in the step 2 : Fetch data the link is not working.
Hello All , As Simplilearn transferred to cloud so please suggest me for practice over juypter Notebook as earlier ..not feeling good over that cloud ...As today I needs for this for practicing
Hello Eli ,
I requesting to you to please provide us interview related questions on ML ,Stats overall related to DS.
I have one question.
Scenario : Total 10,000 records for classification, let's say [wine and beer],
for wine : 7000 records
for beer : 3000 records
I want to take a sample of 1,000 records with 500 each ( equal ) for ML.
How can I do it in python sklean's train_test_split.
train,test=train_test_split(X,y,stratify=y) # it obeys the 70:30 ratio but i want 50:50
Not sure I understand your question.
The traditional and most common value is 70-30 or 75-25. If you have 10k or 30k samples, it is fine to go with 70-30 split. But when dealing with Big-data, for example if you have 1 million samples, it is not recommended to have 30k samples as test data, so in that case, 90-10 is actually okay. Because 10k test samples can pretty much provide an intuition about the model.
in brief: for less samples, go with recommended 70-30 split, for much higher samples, go with number of samples.
If you have enough data, then you can actually go for a 50-50 split but there is no such thing as what would be better, depends completely on the amount of data you have and the complexity of the task you are trying to perform.If you train it on enough data, the size of the test set is of no concern. The whole reason for splits comes from the fact that we often have limited and finite data and we want to make the best use of it and train on as much data as we can. So go by the heuristic, and do a 75-25 split. But don't forget to cross-validate on the training set, I would recommend a stratified-K-fold. If your performance metric is suffering, the split would be the last reason for it.
Probability and Statistics Interview Questions
Explain what regularization is and why it is useful.
How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
Explain what precision and recall are. How do they relate to the ROC curve?
How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything?
What is root cause analysis?
Are you familiar with price optimization, price elasticity, inventory management, competitive intelligence? Give examples.
What is statistical power?
Explain what resampling methods are and why they are useful. Also explain their limitations.
Is it better to have too many false positives, or too many false negatives? Explain.
What is selection bias, why is it important and how can you avoid it?
Imagine a test with a true positive rate of 100% and false positive rate of 5%. Imagine a population with a 1/1000 rate of having the condition the test identifies. Given a positive test, what is the probability of having that condition?
What is the normal distribution? Give an example of some variable that follows this distribution.
What about log-normal?
Explain what a long tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and prediction problems?
How to check if a distribution is close to Normal? Why would you want to check it? What is a QQ Plot?
Give examples of data that does not have a Gaussian distribution, or log-normal.
Do you know what the exponential family is?
Do you know the Dirichlet distribution? the multinomial distribution
What is the Laws of Large Numbers? Central Limit Theorem?
Why are they important for Statistics?
What summary statistics do you know?
Data Modeling Interview Questions
What are the most important skills for a data scientist to have?
What types of data are important for business needs?
What data would you go after and start working on?
What are the assumptions required for linear regression?
When you get a new data set, what do you do with it to see if it will suit your needs for a given project?
How do you handle big data sets?
How do you detect outliers?
How do you control model complexity?
How do you model a quantity you can’t observe?
You have one model and want to find the best set of parameters for this model. How would you do that?
How would you look for the best parameters? Do you know something else apart from grid search?
What is Cross-Validation?
What is 10-Fold CV?
What is the difference between holding out a validation set and doing 10-Fold CV?
How do you know if your model overfits?
How do you assess the results of a logistic regression?
Which evaluation metrics you know? Something apart from accuracy?
Which is better: Too many false positives or too many false negatives?
What precision and recall are?
What is a ROC curve? What is AU ROC (AUC)? How to interpret the curve and AU ROC?
Do you know about Concordance or Lift?
Data Science Process Interview Questions
How would you create a taxonomy to identify key customer trends in unstructured data?
Python or R — Which one would you prefer for text analytics?
Which technique is used to predict categorical responses?
What is logistic regression? Or State an example when you have used logistic regression recently.
What are Recommender Systems?
Why data cleaning plays a vital role in analysis?
Differentiate between univariate, bivariate and multivariate analysis.
What do you understand by the term Normal Distribution?
What is Linear Regression?
What is Interpolation and Extrapolation?
What is power analysis?
What is K-means? How can you select K for K-means?
What is Collaborative filtering?
What is the difference between Cluster and Systematic Sampling?
Are expected value and mean value different?
What does P-value signify about the statistical data?
Do gradient descent methods always converge to same point?
What are categorical variables?
How you can make data normal using Box-Cox transformation?
What is the difference between Supervised Learning an Unsupervised Learning?
Explain the use of Combinatorics in data science.
Why is vectorization considered a powerful method for optimizing numerical code?
What is the goal of A/B Testing?
What is an Eigenvalue and Eigenvector?
What is Singular Value Decomposition?
What is Gradient Descent?
How can outlier values be treated?
How can you assess a good logistic model?
How can you iterate over a list and also retrieve element indices at the same time?
During analysis, how do you treat missing values?
Explain about the box cox transformation in regression models.
Can you use machine learning for time series analysis?
Write a function that takes in two sorted lists and outputs a sorted list that is their union.
What is the difference between Bayesian Inference and Maximum Likelihood Estimation (MLE)?
What is Regularization and what kind of problems does regularization solve?
What is multicollinearity and how you can overcome it?
What is the curse of dimensionality?
How do you decide whether your linear regression model fits the data?
What is the difference between squared error and absolute error?
What is Machine Learning?
How are confidence intervals constructed and how will you interpret them?
How will you explain logistic regression to an economist, physician scientist and biologist?
How can you overcome Overfitting?
Differentiate between wide and tall data formats?
Is Naïve Bayes bad? If yes, under what aspects.
How would you develop a model to identify plagiarism?
Can you outline the steps in an analytics project?
Have you heard of CRISP-DM (Cross Industry Standard Process for Data Mining)?
Data Science Machine Learning Interview Questions
What is your favorite ML algorithm and why?
Describe the regression problem. Is it supervised learning? Why?
What is linear regression? Why is it called linear?
Discuss the bias-variance tradeoff.
What is Ordinary Least Squares Regression? How it can be learned?
Can you derive the OLS Regression formula? (For one-step solution)
Do we always need the intercept term? When do we need it and when do we not?
What is collinearity and what to do with it? How to remove multicollinearity?
What if the design matrix is not full rank?
What is overfitting a regression model? What are ways to avoid it?
What is Ridge Regression? How is it different from OLS Regression? Why do we need it?
What is Lasso regression? How is it different from OLS and Ridge?
What are the assumptions required for linear regression?
You would like to find significant features. How would you do that?
You fit a multiple regression to examine the effect of a particular feature. The feature comes back insignificant, but you believe it is significant. Why can it happen?
How to check is the regression model fits the data well?
Can you describe what is the classification problem?
What is the simplest classification algorithm?
What classification algorithms do you know? Which one you like the most?What is a decision tree?
What are some business reasons you might want to use a decision tree model?
How do you build it?
What impurity measures do you know?
Describe some of the different splitting rules used by different decision tree algorithms.
Is a big brushy tree always good? Why would you want to prune it?
Is it a good idea to combine multiple trees?
What is Random Forest? Why is it good?
What is logistic regression?
How do we train a logistic regression model?
How do we interpret its coefficients?
What is an Artificial Neural Network?
How to train an ANN? What is back propagation?
How does a neural network with three layers (one input layer, one inner layer and one output layer) compare to a logistic regression?
What is deep learning? What is CNN (Convolution Neural Network) or RNN (Recurrent Neural Network)?
What is Regularization?
Which problem does Regularization try to solve?
What does it mean (practically) for a design matrix to be “ill-conditioned”?
When might you want to use ridge regression instead of traditional linear regression?
What is the difference between the L1 and L2 regularization?
Why (geometrically) does LASSO produce solutions with zero-valued coefficients (as opposed to ridge)?
What is the purpose of dimensionality reduction and why do we need it?
Are dimensionality reduction techniques supervised or not? Are all of them are (un)supervised?
What ways of reducing dimensionality do you know?
Is feature selection a dimensionality reduction technique?
What is the difference between feature selection and feature extraction?
Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?
What is Principal Component Analysis (PCA)? What is the problem it solves? How is it related to eigenvalue decomposition (EVD)?
What’s the relationship between PCA and SVD? When SVD is better than EVD for PCA?
Under what conditions is PCA effective?
Why do we need to center data for PCA and what can happed if we don’t do it? Do we need to scale data for PCA?
Is PCA a linear model or not? Why?
Do you know other Dimensionality Reduction techniques?
What is Independent Component Analysis (ICA)? What’s the difference between ICA and PCA?
Suppose you have a very sparse matrix where rows are highly dimensional. You project these rows on a random vector of relatively small dimensionality. Is it a valid dimensionality reduction technique or not?
Have you heard of Kernel PCA or other non-linear dimensionality reduction techniques? What about LLE (Locally Linear Embedding) or tt-SNE (tt-distributed Stochastic Neighbor Embedding)
What is Fisher Discriminant Analysis? How it is different from PCA? Is it supervised or not?
What is the difference between a convex function and non-convex?
What is Gradient Descent Method?
Will Gradient Descent methods always converge to the same point?
What is a local optimum?
Is it always bad to have local optima?
What the Newton’s method is?
What kind of problems are well suited for Newton’s method? BFGS? SGD?
What are “slack variables”?
Describe a constrained optimization problem and how you would tackle it.
What is NLP? How is it related to Machine Learning?
How would you turn unstructured text data into structured data usable for ML models?
What is the Vector Space Model?
What is TF-IDF?
Which distances and similarity measures can we use to compare documents? What is cosine similarity?
Why do we remove stop words? When do we not remove them?
Language Models. What is NN-Grams?
What is Curse of Dimensionality? How does it affect distance and similarity measures?
What are the problems of large feature space? How does it affect different models, e.g. OLS? What about computational complexity?
What dimensionality reductions can be used for preprocessing the data?
What is the difference between density-sparse data and dimensionally-sparse data?
Data Science Culture Fit Interview Questions
Which is your favorite machine learning algorithm and why?
In which libraries for Data Science in Python and R, does your strength lie?
What kind of data is important for specific business requirements and how, as a data scientist will you go about collecting that data?
Tell us about the biggest data set you have processed till date and for what kind of analysis.
Which data scientists you admire the most and why?
Suppose you are given a data set, what will you do with it to find out if it suits the business needs of your project or not.
What were the business outcomes or decisions for the projects you worked on?
What unique skills you think can you add on to our data science team?
Which are your favorite data science startups?
Why do you want to pursue a career in data science?
What have you done to upgrade your skills in analytics?
What has been the most useful business insight or development you have found?
How will you explain an A/B test to an engineer who does not know statistics?
When does parallelism helps your algorithms run faster and when does it make them run slower?
How can you ensure that you don’t analyse something that ends up producing meaningless results?
How would you explain to the senior management in your organization as to why a particular data set is important?
Is more data always better?
What are your favorite imputation techniques to handle missing data?
What are your favorite data visualization tools?
During one of my client discussions, I came across the Association rules being implemented across the Retail segment , to visualize trend in sales.
I am curious if we are planning to delve into any of the following topics -
1) Association Rules of Shopping trend samplers
2) Market basket analysis
3) Clustering - Hierarchical Clustering
Any of your discovered hot links can always be helpful, meanwhile.
Thanks for your inspiring support,