ML | 08 Dec - 12 Jan | Armando Galeana

Discussion in 'Big Data and Analytics' started by Nishant_Singh, Dec 8, 2018.

  1. Nishant_Singh

    Nishant_Singh Well-Known Member
    Simplilearn Support

    Joined:
    Aug 1, 2018
    Messages:
    159
    Likes Received:
    9
    Hi Learner,

    This thread is created for you to discuss the query and concepts related to Machine Learning Advanced Certification course.

    Happy Learning !!

    Regards,
    Team Simplilearn
     
    #1
    Armando Galeana_1 likes this.
  2. _38972

    _38972 New Member

    Joined:
    Aug 30, 2018
    Messages:
    1
    Likes Received:
    0
    Armando, I am not sure if you could load the fie you created last week.
     
    #2
  3. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    Some useful links

    Data types in Python
    https://docs.python.org/3.6/library/datatypes.html

    Sample Size Calculator
    https://www.surveymonkey.com/mp/sample-size-calculator/

    Data growth forecast by Cisco
    https://www.cisco.com/c/en/us/solut...rking-index-vni/vni-hyperconnectivity-wp.html

    Data Sources (a few of them....)

    General ML

    http://archive.ics.uci.edu/ml/index.php
    https://www.reddit.com/r/datasets/
    https://www.kaggle.com/datasets
    https://opendata.socrata.com/
    https://aws.amazon.com/public-data-sets/
    https://cloud.google.com/bigquery/public-data/
    http://academictorrents.com/browse.php?cat=6

    For Deep Learning

    http://deeplearning.net/datasets/
    https://deeplearning4j.org/opendata

    For NLP

    https://github.com/niderhoff/nlp-datasets
    https://www.quora.com/Datasets-What...are-the-characteristics-biases-of-each-corpus
    https://en.wikipedia.org/wiki/Wikipedia:Database_download
    https://www.data.world/

    Datasets for Time Series Analysis

    https://www.quandl.com/
    http://data.worldbank.org/data-catalog/

    Datasets for Recommender Systems

    https://gist.github.com/entaroadun/1653794

    Datasets by Industry

    https://github.com/caesar0301/awesome-public-datasets
    https://www.data.gov/
    http://data.un.org/
    http://data.worldbank.org/Datasets for Streaming
    https://www.satori.com/explore

    Datasets for Web Scraping

    http://toscrape.com/

    Datasets for Current Events

    https://github.com/fivethirtyeight/data
    https://github.com/BuzzFeedNews/everything

    Other sources:

    https://blog.bigml.com/list-of-public-data-sources-fit-for-machine-learning/
    http://homepages.inf.ed.ac.uk/rbf/IAPR/researchers/MLPAGES/mldat.htm
     
    #3
    _37911 likes this.
  4. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    How Much Training Data is Required for Machine Learning?
    by Jason Brownlee on July 24, 2017

    The amount of data you need depends both on the complexity of your problem and on the complexity of your chosen algorithm.
    This is a fact, but does not help you if you are at the pointy end of a machine learning project.

    A common question I get asked is:

    How much data do I need?

    I cannot answer this question directly for you, or for anyone. But I can give you a handful of ways of thinking about this question.
    In this post, I lay out a suite of methods that you can use to think about how much training data you need to apply machine learning to your problem.

    My hope that one or more of these methods may help you understand the difficulty of the question and how it is tightly coupled with the heart of the induction problem that you are trying to solve.

    Let’s dive into it.

    Why Are You Asking This Question?
    It is important to know why you are asking about the required size of the training dataset.

    The answer may influence your next step.

    For example:

    • Do you have too much data? Consider developing some learning curves to find out just how big a representative sample is (below). Or, consider using a big data framework in order to use all available data.
    • Do you have too little data? Consider confirming that you indeed have too little data. Consider collecting more data, or using data augmentation methods to artificially increase your sample size.
    • Have you not collected data yet? Consider collecting some data and evaluating whether it is enough. Or, if it is for a study or data collection is expensive, consider talking to a domain expert and a statistician.
    More generally, you may have more pedestrian questions such as:

    • How many records should I export from the database?
    • How many samples are required to achieve a desired level of performance?
    • How large must the training set be to achieve a sufficient estimate of model performance?
    • How much data is required to demonstrate that one model is better than another?
    • Should I use a train/test split or k-fold cross validation?
    It may be these latter questions that the suggestions in this post seek to address.

    In practice, I answer this question myself using learning curves (see below), using resampling methods on small datasets (e.g. k-fold cross validation and the bootstrap), and by adding confidence intervals to final results.

    What is your reason for asking about the number of samples required for machine learning?
    Please let me know in the comments.

    So, how much data do you need?

    1. It Depends; No One Can Tell You
    No one can tell you how much data you need for your predictive modeling problem.

    It is unknowable: an intractable problem that you must discover answers to through empirical investigation.

    The amount of data required for machine learning depends on many factors, such as:

    • The complexity of the problem, nominally the unknown underlying function that best relates your input variables to the output variable.
    • The complexity of the learning algorithm, nominally the algorithm used to inductively learn the unknown underlying mapping function from specific examples.
    This is our starting point.

    And “it depends” is the answer that most practitioners will give you the first time you ask.

    2. Reason by Analogy
    A lot of people have worked on a lot of applied machine learning problems before you.

    Some of them have published their results.

    Perhaps you can look at studies on problems similar to yours as an estimate for the amount of data that may be required.

    Similarly, it is common to perform studies on how algorithm performance scales with dataset size. Perhaps such studies can inform you how much data you require to use a specific algorithm.

    Perhaps you can average over multiple studies.

    Search for papers on Google, Google Scholar, and Arxiv.

    3. Use Domain Expertise
    You need a sample of data from your problem that is representative of the problem you are trying to solve.

    In general, the examples must be independent and identically distributed.

    Remember, in machine learning we are learning a function to map input data to output data. The mapping function learned will only be as good as the data you provide it from which to learn.

    This means that there needs to be enough data to reasonably capture the relationships that may exist both between input features and between input features and output features.

    Use your domain knowledge, or find a domain expert and reason about the domain and the scale of data that may be required to reasonably capture the useful complexity in the problem.

    4. Use a Statistical Heuristic
    There are statistical heuristic methods available that allow you to calculate a suitable sample size.

    Most of the heuristics I have seen have been for classification problems as a function of the number of classes, input features or model parameters. Some heuristics seem rigorous, others seem completely ad hoc.

    Here are some examples you may consider:

    • Factor of the number of classes: There must be x independent examples for each class, where x could be tens, hundreds, or thousands (e.g. 5, 50, 500, 5000).
    • Factor of the number of input features: There must be x% more examples than there are input features, where x could be tens (e.g. 10).
    • Factor of the number of model parameters: There must be x independent examples for each parameter in the model, where x could be tens (e.g. 10).
    They all look like ad hoc scaling factors to me.

    Have you used any of these heuristics?
    How did it go? Let me know in the comments.

    In theoretical work on this topic (not my area of expertise!), a classifier (e.g. k-nearest neighbors) is often contrasted against the optimal Bayesian decision rule and the difficulty is characterized in the context of the curse of dimensionality; that is there is an exponential increase in difficulty of the problem as the number of input features is increased.

    For example:

    Findings suggest avoiding local methods (like k-nearest neighbors) for sparse samples from high dimensional problems (e.g. few samples and many input features).

    For a kinder discussion of this topic, see:

    5. Nonlinear Algorithms Need More Data
    The more powerful machine learning algorithms are often referred to as nonlinear algorithms.

    By definition, they are able to learn complex nonlinear relationships between input and output features. You may very well be using these types of algorithms or intend to use them.

    These algorithms are often more flexible and even nonparametric (they can figure out how many parameters are required to model your problem in addition to the values of those parameters). They are also high-variance, meaning predictions vary based on the specific data used to train them. This added flexibility and power comes at the cost of requiring more training data, often a lot more data.

    In fact, some nonlinear algorithms like deep learning methods can continue to improve in skill as you give them more data.

    If a linear algorithm achieves good performance with hundreds of examples per class, you may need thousands of examples per class for a nonlinear algorithm, like random forest, or an artificial neural network.

    6. Evaluate Dataset Size vs Model Skill
    It is common when developing a new machine learning algorithm to demonstrate and even explain the performance of the algorithm in response to the amount of data or problem complexity.

    These studies may or may not be performed and published by the author of the algorithm, and may or may not exist for the algorithms or problem types that you are working with.

    I would suggest performing your own study with your available data and a single well-performing algorithm, such as random forest.

    Design a study that evaluates model skill versus the size of the training dataset.

    Plotting the result as a line plot with training dataset size on the x-axis and model skill on the y-axis will give you an idea of how the size of the data affects the skill of the model on your specific problem.

    This graph is called a learning curve.

    From this graph, you may be able to project the amount of data that is required to develop a skillful model, or perhaps how little data you actually need before hitting an inflection point of diminishing returns.

    I highly recommend this approach in general in order to develop robust models in the context of a well-rounded understanding of the problem.

    7. Naive Guesstimate
    You need lots of data when applying machine learning algorithms.

    Often, you need more data than you may reasonably require in classical statistics.

    I often answer the question of how much data is required with the flippant response:

    Get and use as much data as you can.

    If pressed with the question, and with zero knowledge of the specifics of your problem, I would say something naive like:

    • You need thousands of examples.
    • No fewer than hundreds.
    • Ideally, tens or hundreds of thousands for “average” modeling problems.
    • Millions or tens-of-millions for “hard” problems like those tackled by deep learning.
    Again, this is just more ad hoc guesstimating, but it’s a starting point if you need it. So get started!

    8. Get More Data (No Matter What!?)
    Big data is often discussed along with machine learning, but you may not require big data to fit your predictive model.

    Some problems require big data, all the data you have. For example, simple statistical machine translation:

    If you are performing traditional predictive modeling, then there will likely be a point of diminishing returns in the training set size, and you should study your problems and your chosen model/s to see where that point is.

    Keep in mind that machine learning is a process of induction. The model can only capture what it has seen. If your training data does not include edge cases, they will very likely not be supported by the model.

    Don’t Procrastinate; Get Started
    Now, stop getting ready to model your problem, and model it.

    Do not let the problem of the training set size stop you from getting started on your predictive modeling problem.

    In many cases, I see this question as a reason to procrastinate.

    Get all the data you can, use what you have, and see how effective models are on your problem.

    Learn something, then take action to better understand what you have with further analysis, extend the data you have with augmentation, or gather more data from your domain.

    Further Reading
    This section provides more resources on the topic if you are looking go deeper.

    There is a lot of discussion around this question on Q&A sites like Quora, StackOverflow, and CrossValidated. Below are few choice examples that may help.

    I expect that there are some great statistical studies on this question; here are a few I could find.

    Other related articles.

     
    #4
  5. Sreenivasan Shankaran(1920)

    Joined:
    Jun 15, 2010
    Messages:
    1
    Likes Received:
    0
    Hi Armando , can you please post the program you did last week which included all the Regression.

    Thanks.
     
    #5
  6. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    #6
  7. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    How to Deploy Machine Learning Models




    You can also look for "End to End Machine Learning" or "Machine Learning Deployment with Django or Flask"
     
    #7
  8. _50775

    _50775 Member

    Joined:
    Dec 5, 2018
    Messages:
    3
    Likes Received:
    1
    I have question regarding Project -Phishing detector with LR
    You are provided with the following resources that can be used as inputs for your model:
    1. A collection of website URLs for 11000+ websites. Each sample has 30 website parameters and a class label identifying it as a phishing website or not (1 or -1).
    When i checked the data file
    Machine_Learning_Projects\Projects\Projects for submission\Phishing detector with LR\Dataset for the project\phishing
    1. There is no label in the data file for 30 web-site parameters.?
    2. And also all 30 paameters has value 0 -1 or 1. URL name is not there

    Attached the data file
     

    Attached Files:

    #8
  9. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    Look at the "Phishing detector with KNN" practice project.
    You will find the meaning of every feature; however, this has no relevance whatsoever. What you need to know is that it's the last column's value what you'll have to predict. Last column is your target variable that has values of -1 and 1.

    Here is how you import your data and the meaning of each attribute.

    ------------------------------------------------------------------------------

    phishing = np.loadtxt('./data/phishing.txt', delimiter=',')

    #attribute having_IP_Address { -1,1 }
    #attribute URL_Length { 1,0,-1 }
    #attribute Shortining_Service { 1,-1 }
    #attribute having_At_Symbol { 1,-1 }
    #attribute double_slash_redirecting { -1,1 }
    #attribute Prefix_Suffix { -1,1 }
    #attribute having_Sub_Domain { -1,0,1 }
    #attribute SSLfinal_State { -1,1,0 }
    #attribute Domain_registeration_length { -1,1 }
    #attribute Favicon { 1,-1 }
    #attribute port { 1,-1 }
    #attribute HTTPS_token { -1,1 }
    #attribute Request_URL { 1,-1 }
    #attribute URL_of_Anchor { -1,0,1 }
    #attribute Links_in_tags { 1,-1,0 }
    #attribute SFH { -1,1,0 }
    #attribute Submitting_to_email { -1,1 }
    #attribute Abnormal_URL { -1,1 }
    #attribute Redirect { 0,1 }
    #attribute on_mouseover { 1,-1 }
    #attribute RightClick { 1,-1 }
    #attribute popUpWidnow { 1,-1 }
    #attribute Iframe { 1,-1 }
    #attribute age_of_domain { -1,1 }
    #attribute DNSRecord { -1,1 }
    #attribute web_traffic { -1,0,1 }
    #attribute Page_Rank { -1,1 }
    #attribute Google_Index { 1,-1 }
    #attribute Links_pointing_to_page { 1,0,-1 }
    #attribute Statistical_report { -1,1 }
    #attribute Result { -1,1 }
     
    #9
  10. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    #10
  11. Poonam Choudhary Arora

    Poonam Choudhary Arora Active Member
    Alumni Customer

    Joined:
    Mar 15, 2018
    Messages:
    20
    Likes Received:
    1
    Hi Armando ,
    Greetings! This question is related to Phishing detection through LR. While working on the project , I stumbled upon a weird behavior of regression model , i.e. when the data is scaled through StandardScalar , the model accuracy is lower vs. when data is not scaled . Is it because entire dataset is already on same scale of -1 to +1 . I still don't understand why scaling will reduce the accuracy . Thoughts?
     
    #11
  12. _50775

    _50775 Member

    Joined:
    Dec 5, 2018
    Messages:
    3
    Likes Received:
    1
    Thanks Armando!
     
    #12
    Armando Galeana_1 likes this.
  13. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    There is no need to scale the data. Scaling ensures you avoid "dominant" features and give all features the same "influence" opportunity.
    In this case, you don't have any dominant features as the data of features is within the same range. Scaling the data may mistakenly define "dominant" features.

    Hope this helps
     
    #13
  14. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    Correction: There is no need to scale the data in this project
     
    #14
  15. Poonam Choudhary Arora

    Poonam Choudhary Arora Active Member
    Alumni Customer

    Joined:
    Mar 15, 2018
    Messages:
    20
    Likes Received:
    1
    Thanks for the quick clarification . This most certainly help is understanding the impact of wrongly scaling the 'already standardized' dataset .
     
    #15
  16. Poonam Choudhary Arora

    Poonam Choudhary Arora Active Member
    Alumni Customer

    Joined:
    Mar 15, 2018
    Messages:
    20
    Likes Received:
    1
    Hi Armando ,
    Hope you having a great day!
    This is regarding the 2nd project- California housing prices . While working on the project , i created the model with all 3 regression approach , however for all 3 the RMSE is pretty high (for LinReg :1.4523924547568362e+16 , DTree:75220.52998545596 and RForest: 58578.731808731616 ). So my question is , is such high RMSE expected and normal or am i be doing something terribly wrong (FYI, I have followed through all the steps in the instruction ). Thanks in advance for your inputs
    Regards
    Poonam
     
    #16
  17. Poonam Choudhary Arora

    Poonam Choudhary Arora Active Member
    Alumni Customer

    Joined:
    Mar 15, 2018
    Messages:
    20
    Likes Received:
    1
    FYI, I also calculated the corresponding R2 scores and for RF its highest but still close to 0.75; does that sound ok or might be missing something ?
     
    #17
  18. Poonam Choudhary Arora

    Poonam Choudhary Arora Active Member
    Alumni Customer

    Joined:
    Mar 15, 2018
    Messages:
    20
    Likes Received:
    1
    Hi Armando - not sure if you got a chance to review my question yet

    This is regarding the 2nd project- California housing prices . While working on the project , i created the model with all 3 regression approach , however for all 3 the RMSE is pretty high (for LinReg :1.4523924547568362e+16 , DTree:75220.52998545596 and RForest: 58578.731808731616 ). So my question is , is such high RMSE expected and normal or am i be doing something terribly wrong (FYI, I have followed through all the steps in the instruction )

    FYI, I also calculated the corresponding R2 scores and for RF its highest but still close to 0.75; does that sound ok or might be missing something ?

    As project submission is due shortly would hope to get an answer soon

    Regards
    Poonam
     
    #18
  19. Armando Galeana_1

    Customer

    Joined:
    Nov 14, 2018
    Messages:
    21
    Likes Received:
    2
    Hello Poonam, the high values are expected as they depend on the scale (or data ranges) of your data. RMSE may not tell as much for performance as r2. In the project, high RMSE values are expected. r2 = 0.75 is decent... you won't get an r2=1 unless your data is overfit, so anything in the ranges between 0.7 and 0.999 are generally good values for r2.

    Hope this helps

    Best,

    Armando
     
    #19

Share This Page