Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Natural Language Processing | Wajahat | Jan 30 - Feb 21 | 2021


Here is the Hashtag asisgnment with one value:

Hi Wajahat,
I have re-worked on the hashtag assignment. New one is attached.


  • Hashtags_new.txt
    17.7 KB · Views: 9
Last edited:


Hi Wajahat,

Attached is the practice for word analogy using gensim.

Also attached news practice using gensim.


  • Word Analogy Practice.txt
    3.8 KB · Views: 7
  • Session2_assignment_News.txt
    23.3 KB · Views: 5
Last edited:


Hi Sir,
Kindly find my script attached for assignment-1: Tweets Cleanup and Analysis Using Regular Expressions.
Kindly let me know if i am good or more to modify.



  • Untitled1 (2).JPG
    Untitled1 (2).JPG
    117.9 KB · Views: 20
Last edited:


Hi Sir,
Please check the Word Analogy assignment attached.



  • Asssignment_2_WordAnalogy.txt
    3.9 KB · Views: 3

Jude Rodriguez

Active Member
have tried a different approach to the newsgroups problem, but have encountered an issue with regex application to capture multiple lines separated by '\n' followed by sentence in repeated fashion. Went through the regex manual on w3 website, but no luck. can anyone help please? Any hint would be greatly appreciated, thanks.


  • Assignment 2 - 31-JAN-21.pdf
    344.7 KB · Views: 5
  • Screenshot_145.png
    31.8 KB · Views: 14
  • Screenshot_146.png
    52.6 KB · Views: 10

Jude Rodriguez

Active Member
have tried a different approach to the newsgroups problem, but have encountered an issue with regex application to capture multiple lines separated by '\n' followed by sentence in repeated fashion. Went through the regex manual on w3 website, but no luck. can anyone help please? Any hint would be greatly appreciated, thanks.
found a regex function
print(re.findall('\n\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*',train[0][0])). successfully extracts contents in the first cell. Not sure how to generalize for all as there will be different number of '\n' in each cell, full content may not be extracted, only partial.


  • Screenshot_147.png
    37.2 KB · Views: 14
Using word2vec to identify intent in the chatbot solution:

Load library, download https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

from gensim.models import KeyedVectors

Add this to the Session class:

self.model = KeyedVectors.load_word2vec_format('Datafiles/GoogleNews-vectors-negative300.bin', binary=True)

modify this line to send the model to the function

self.current_intent = intentIdentifier(self.model, user_input, self.context, self.current_intent)

then the intentIdentifier function becomes:

def intentIdentifier(model, user_input, context, current_intent):
        user_input = user_input.lower()
        scores = [('OrderBook',model.wmdistance(user_input,"buy mobile phone")), ('StoreSearch',model.wmdistance(user_input,"search for restaurant"))]
        scores = sorted_by_second = sorted(scores, key=lambda tup: tup[0])
        return loadIntent('params/newparams.cfg', scores[0][0])
        #If current intent is not none, stick with the ongoing intent
        return current_intent
Last edited:
Project 2 assignment complete:

Logistic Regression on imbalanced dataset with word2vec vectiorizer:
Accuracy Score: 0.9457218833098702
Confusion Matrix:
[[5856 81]
[ 266 190]]
Area Under Curve: 0.7015117062489473
Recall score: 0.4166666666666667

Logistic Regression on imbalanced dataset with word2vec vectiorizer and weight adjustment
Accuracy Score: 0.874393868293446
Confusion Matrix:
[[5211 726]
[ 77 379]]
Area Under Curve: 0.8544281845340993
Recall score: 0.831140350877193

Logistic regression after GridSearch with best parameters (only marginal improvement)
Best score: 0.9257436627641153 with param: {'C': 0.09, 'class_weight': {0: 0.07,  1: 0.93}, 'penalty': 'l2'}
Accuracy Score: 0.874863131550133
Confusion Matrix:
[[5210 727]
[ 73 383]]
Area Under Curve: 0.8587299318280542
Recall score: 0.8399122807017544

Stratified KFold results on imbalanced / SMOTE balanced dataset:
Imbalanced: 1 of KFold 4
recall score: 0.82174688057041
Imbalanced: 2 of KFold 4
recall score: 0.8253119429590018
Imbalanced: 3 of KFold 4
recall score: 0.8107142857142857
Imbalanced: 4 of KFold 4
recall score: 0.8160714285714286
SMOTE: 1 of KFold 4
recall score: 0.9958277254374159
SMOTE: 2 of KFold 4
recall score: 0.9965006729475101
SMOTE: 3 of KFold 4
recall score: 0.9946164199192463
SMOTE: 4 of KFold 4
recall score: 0.9960969044414536

Finally, when using neural network things become very good. Here is with MLPClassifier on SMOTE balanced dataset:
Accuracy Score: 0.9836810228802153
Confusion Matrix:
[[5716 176]
[ 18 5978]]
Area Under Curve: 0.9835634935623523
Recall score: 0.9969979986657772

I did the GridSearch against the entire dataset not only training set, and with roc_auc scoring.. After splitting the dataset back into training and testing sets, the actual roc_auc score dropped. If this is totally expected and the right way to do it, then I am ready to submit the project for grading.
Update: Topics from NEWS Items assignment


  • Topics_from_NEWS_Items-code_sheet_1.JPG
    128.9 KB · Views: 13
  • Topics_from_NEWS_Items-code_sheet_2.JPG
    115.3 KB · Views: 8
  • Topics_from_NEWS_Items-code_sheet_3.JPG
    118.2 KB · Views: 9
I tried various options to access known news websites, including reuters but it was blocked. So I had to use an indonesian soccer site and the queries had to be made in indonesian. aplogies


  • NEWS_Engine_Assignment_Code_Pg_1.JPG
    105.1 KB · Views: 9
  • NEWS_Engine_Assignment_Code_Pg_3.JPG
    130.5 KB · Views: 7
  • NEWS_Engine_Assignment_Code_Pg_2.JPG
    238.1 KB · Views: 8
  • NEWS_Engine_Assignment_Code_Pg_4.JPG
    106.1 KB · Views: 10
for fun: web deployment of the aiml chatbot:

1. make a folder on your webserver and place the usual files, plus make one aimlpy.py file as below:

import aiml
import asyncio
from autocorrect import Speller
from aiohttp import web
import aiohttp_cors

async def start_model(app):
    model = 'aiml_pretrained_model.dump'
    k = aiml.Kernel()
    k.bootstrap(learnFiles='learningFileList.aiml', commands='load aiml')
    app['k'] = k

async def dispose_model(app):
    await app['k'].wait_closed()

async def index(request):
    query = request.rel_url.query["query"]
    check = Speller(lang='en')
    query = [check(w) for w in (query.split())]
    question = " ".join(query)
    response = request.app['k'].respond(question)
    return web.Response(text=response,headers={
            "X-Custom-Server-Header": "Custom data",

async def my_web_app():
    app = web.Application()
    cors = aiohttp_cors.setup(app)
    resource = cors.add(app.router.add_resource("/"))
    route = cors.add(
        resource.add_route("GET", index), {
            "https://your.frontend.site.address": aiohttp_cors.ResourceOptions(
                allow_headers=("X-Requested-With", "Content-Type"),
    return app

2. make sure latest gunicorn is installed and run the website like this from terminal

gunicorn --certfile=/path/to/ssl/certs/combined aimlpy:my_web_app --bind your.aiml.webapp.address:8020 --worker-class aiohttp.GunicornWebWorker

Here the combined file contains the SSL certificate, key and intermediate certificate. use any available port (8020 in example), open the firewall to allow access

3. make a webpage on https://your.frontend.site.address with a simple form with a text/paragraph field and a submit button. Bind an AJAX call to the submit button like this:

<div id="ajaxcontact-text">
        <div id="ajaxcontact-response" style="background-color:#E6E6FA ;color:blue;"></div>
        <textarea id="ajaxcontactcontents" name="ajaxcontactcontents" rows="10" cols="20"></textarea><br>
        <a onclick="ajaxformsend(ajaxcontactcontents.value);" style="cursor: pointer"><b>Ask me</b></a>

function ajaxformsend(contents) {
        method: 'GET',
        url: 'https://your.aiml.webapp.address:8020',
        data: {
            query: contents
        success: function (data, textStatus, XMLHttpRequest) {
            var id = '#ajaxcontact-response';
        error: function (MLHttpRequest, textStatus, errorThrown) {

4. invite your coworkers to have little fun
Last edited:
I got a problem specific to my industry. I would like to be able to augument the AIML pretrained model we used in class to be able to answer questions about SAP.

I can provide a structured SAP glossary for 50000+ entries such as depicted here: https://sap.wikitip.info/

I need some guidance on how to retain the existing understanding of question structure while detecting the intent about SAP (I can instruct users to always include SAP word in the query) then dig into the SAP corpora provided by the glossary.
Nice piece of code to visualise keywords in topic-modeling.

import pyLDAvis.gensim
import pickle
import pyLDAvis
# Visualize the topics
LDAvis_prepared = pyLDAvis.gensim.prepare(model, bow_corpus,diction)
Sir this is the Final project, i have doubt in naming the Topics ,can you please give the example on how to name them


  • Topic_modeling.ipynb - Colaboratory.pdf
    238.8 KB · Views: 17


Hi Sir,
Kindly find my notebook, i have made top 12 topics, but not able to understand what are these and why it is like this and why NN are there as topic. please guide where i am doing mistakes, please guide.

Thanks & Regards,


  • SL_NLP_Project_Topic_Analysis_of_Review_Data.zip
    995.4 KB · Views: 12
Sir I am unable to understand Project 1 subtopic 9(which says "Analyze the topics through the business lens.").. I understand what is to be done.. kindly help me out in this..
script for Summarization of news assignment - class 7 20 feb


  • summarization of news - Problem statement.JPG
    summarization of news - Problem statement.JPG
    115.4 KB · Views: 5
  • summarization of news - script.JPG
    summarization of news - script.JPG
    42.1 KB · Views: 8

Jude Rodriguez

Active Member
in project twitter analysis had created a calculated list in dataframe which became a nested list. How can it be reduced using lambda or list comprehension, please?


  • Screenshot_148.png
    90.5 KB · Views: 8
Please guide me on how to start with the following task given.
Note: the original file is actually in an xlsx format, I don't know why I'm not able to attach an excel file, so I just copied the text snippets from the excel file and pasted it in text doc


Develop a simple sentiment analyzer

Part 1

The task is to create code that is able to read a text snippet and identify the sentiment of the input text snippet based on a dictionary.

This dictionary needs to store 2 sets of words – words indicating positive sentiment (e.g., good, great, happy, etc) and the other set of words indicate negative sentiment (e.g., bad, angry, unhappy, sad, etc). You will need to build this dictionary and take care to see those simple transformations of words such as case changes or different noun or verb forms do not affect the functioning of your code. You may also extend the dictionary to include phrases or terms containing multiple words to make it more general.

In case the input text snippet includes both negative and positive words, you may count the number of positive and negative terms and give the snippet a sentiment score that ranges from -1 to 1. An example of such a function is given next:

Sentiment score =

(Number of positive words – number of negative words)

max{number of positive words, number of negative words}

A score close to 0 indicates a neutral statement; whereas a score close to 1 or -1 indicates a positive or negative sentiment respectively.

You may use the sample text snippets (attached) and dictionary links provided with this mail and extend them to make them more general. Feel free to think about other enhancements that will improve sentiment analyser and implement them. Remember, the idea is to be able to capture sentiments accurately for customer feedback or customer service comments

Part 2

Extend the code created in part 1 to read multiple text snippets (number may vary from 1 to a few thousand) and identify sentiments for each of those snippets based on the stored dictionary.


You can start with the dictionary based on the following links:



Thanks and Regards,


  • sample text.txt
    880 bytes · Views: 0