Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

Big Data Hadoop and Spark Developers | Mar 6,7,13,14,20,21,27,28 Apr 3,4,10,11,17 | Syed Rizvi

hi sir,

how to proceed after this step.

Stderr: VBoxManage.exe: error: The native API dll was not found (C:\Windows\system32\WinHvPlatform.dll) (VERR_NEM_NOT_AVAILABLE).
VBoxManage.exe: error: AMD-V is disabled in the BIOS (or by the host OS) (VERR_SVM_DISABLED)
VBoxManage.exe: error: Details: code E_FAIL (0x80004005), component ConsoleWrap, interface IConsole

iam getting this error
Got similar error while installing vagrant using the the github repo which Syed sir had provided. Eventually I had to edit my BIOS and enable virtualization. This link has good info on this. https://www.partitionwizard.com/partitionmanager/not-in-a-hypervisor-partition.html.
 
Hi Harsh,

Try opening the file in Notepad++. Check if you can see some whitespaces / special chars / anything suspecious :)

~Syed
Thanks Syed. I opened the file in Notepad++ and it has control-linefeed (CR-LF) at the end as below. I was able to load successfully to MySQL table using LINES TERMINATED BY '\r\n' instead of LINES TERMINATED BY '\n' . It seems for Windows file we need \r\n and for Unix files \n only.

1617729874657.png
 
Excellent Dilip !

Way to go dude...

by any chance, were you able to set up the below repo ?


~Syed

I haven't installed Vagrant yet. I found the hadoop, Spark and Scala version are not latest. I am using the Hadoop 3.x, Spark 3.x and latest Scala version (independent installation) to get the advantages of their latest features. But surely I had to use that same vagrant file do some modifications to reflect the latest configuration.
 
Hi Sudhindra,

It seems you have not preprocessed the data.

You need to remove the Quotes(") and Double Quotes (") from your data before loading it into your dataframe.

The only delimiter you should be using is comma (,) which you have done above.

Your columns and their values should now show quotes and double quotes.

~Syed


does the below look right ?

1617737139853.png

Hi Sudhindra. Try this to clean up your file. This might be helpful for others reference too :)

From linux prompt:

Code:
awk 'BEGIN { FS=";"; OFS="," } { gsub("\"", "") } { $1=$1 } 1' dataset_bank_full.csv  > outputfile.csv

The above will clean up your double quotes and give you comma separated file and final result will look like.

1617738252426.png


inferSchema= true --> this will do partition by discovery which will automatically infer your column datatypes. you need not explicitly specify it.

bankDFCsv.printSchema()

|-- age: integer (nullable = true)
|-- job: string (nullable = true)
|-- marital: string (nullable = true)
|-- education: string (nullable = true)
|-- default: string (nullable = true)
|-- balance: integer (nullable = true)
|-- housing: string (nullable = true)
|-- loan: string (nullable = true)
|-- contact: string (nullable = true)
|-- day: integer (nullable = true)
|-- month: string (nullable = true)
|-- duration: integer (nullable = true)
|-- campaign: integer (nullable = true)
|-- pdays: integer (nullable = true)
|-- previous: integer (nullable = true)
|-- poutcome: string (nullable = true)
|-- y: string (nullable = true)
 
Last edited:

masroor.rizvi

Well-Known Member
Trainer
Thanks Syed. I opened the file in Notepad++ and it has control-linefeed (CR-LF) at the end as below. I was able to load successfully to MySQL table using LINES TERMINATED BY '\r\n' instead of LINES TERMINATED BY '\n' . It seems for Windows file we need \r\n and for Unix files \n only.

View attachment 15073
Very good Harsh...way to go..keep working on the Project Use Cases...

~Syed
 

masroor.rizvi

Well-Known Member
Trainer
does the below look right ?

View attachment 15075

Hi Sudhindra. Try this to clean up your file. This might be helpful for others reference too :)

From linux prompt:

Code:
awk 'BEGIN { FS=";"; OFS="," } { gsub("\"", "") } { $1=$1 } 1' dataset_bank_full.csv  > outputfile.csv

The above will clean up your double quotes and give you comma separated file and final result will look like.

View attachment 15076


inferSchema= true --> this will do partition by discovery which will automatically infer your column datatypes. you need not explicitly specify it.

bankDFCsv.printSchema()

|-- age: integer (nullable = true)
|-- job: string (nullable = true)
|-- marital: string (nullable = true)
|-- education: string (nullable = true)
|-- default: string (nullable = true)
|-- balance: integer (nullable = true)
|-- housing: string (nullable = true)
|-- loan: string (nullable = true)
|-- contact: string (nullable = true)
|-- day: integer (nullable = true)
|-- month: string (nullable = true)
|-- duration: integer (nullable = true)
|-- campaign: integer (nullable = true)
|-- pdays: integer (nullable = true)
|-- previous: integer (nullable = true)
|-- poutcome: string (nullable = true)
|-- y: string (nullable = true)
Very good Dilip...helping your fellow learner is a great attitude that you possess. You have shown the leadership characterstics throughout this course. You definately are going to land up with such a role in your career (i.e if you already dont have it :))

All the best my friend !

~Syed
 
Very good Dilip...helping your fellow learner is a great attitude that you possess. You have shown the leadership characterstics throughout this course. You definately are going to land up with such a role in your career (i.e if you already dont have it :))

All the best my friend !

~Syed

Thank you Syed. I wish it come true :) I don't have it yet and have to learn a lot before I deserve such a role :)

Now I am seeking your help or someone from the forum to hint me how to solve these two questions.

1.Check if marital status mattered for a subscription to deposit:
Since y=yes implicitly says customers are qualified for subscription to deposit. The filter contains y='yes' . Do you approve this or something more is required here ?

bankDFCsv.filter($"y"==="yes").groupBy($"marital".alias("Marital Status")).agg(count($"y").alias("Subcribed count")).show
+--------------+---------------+
|Marital Status|Subcribed count|
+--------------+---------------+
| divorced| 622|
| married| 2755|
| single| 1912|
+--------------+---------------+

2. Check if age and marital status together mattered for a subscription to deposit scheme:

NEED HELP

3. Do feature engineering for the bank and find the right age effect on the campaign:

NEED HELP
 
Last edited:
Very good Harsh...way to go..keep working on the Project Use Cases...

~Syed
Hi Syed,

I am working on project 1 (Stock Exchange Data Analysis) need some help for the following last 2 data analysis tasks :
4) Show the best-growing industry by each state, having at least two or more industries mapped.
5) For each sector find the following.
  • Worst year
  • b. Best year
  • c. Stable year
I am not able to understand these 2 questions as the source data is based on stock prices and not sure whether the the questions are in the context of performance of stock or anything else. Can you please help/guide me on these 2 questions ?
 

masroor.rizvi

Well-Known Member
Trainer
Dear Learners,

One of the action items from my last weekend session.

An example of Struct Type in Spark SQL...Please go throught the link below..


Many Thanks,
Syed
 

masroor.rizvi

Well-Known Member
Trainer
Hello Anik @Anik Chakraborty_1

As per my action item on repartition, please have a look at the "Real World Example" section of the below link


The use case here shows how repartitioning works in real time

Happy Learning :)

~Syed
 

Ahamika Banerjee

Active Member
Hello Ahamika,

You screenshots tell me the following :-

1. JDK is installed on your system
2. Eclipse is also installed on your system at the default workspace

Can you please let us know what exactly are you doing (Step by step if possible) and the screenshots of the actual issues you are facing ?

~Syed
Thank you sir. My problem has been resolved and I was able to run all the commands. Though I had few difficulties while running but then the community forum's replies and questions helped me a lot. I have completed half of my HW and half of them is left. Thanks a lot sir. Actually I am completely news in this field, infact in IT industry as I haven't worked before, hence I am taking a bit more time to grasp the new things.
 

masroor.rizvi

Well-Known Member
Trainer
Hi Syed,

I am working on project 1 (Stock Exchange Data Analysis) need some help for the following last 2 data analysis tasks :
4) Show the best-growing industry by each state, having at least two or more industries mapped.
5) For each sector find the following.
  • Worst year
  • b. Best year
  • c. Stable year
I am not able to understand these 2 questions as the source data is based on stock prices and not sure whether the the questions are in the context of performance of stock or anything else. Can you please help/guide me on these 2 questions ?
Helo Harsh,

Here is a thought (I'm supposed to give only hints to the project, not the solution :) ).

- Volume into Price could be a good matrix to consider for stability
- Calculate the above (you might have to join tables) and group by sector
- Top 2 should be your answer to 4.
- group by year. That should give you best and worst years.

~Syed
 

masroor.rizvi

Well-Known Member
Trainer
Thank you sir. My problem has been resolved and I was able to run all the commands. Though I had few difficulties while running but then the community forum's replies and questions helped me a lot. I have completed half of my HW and half of them is left. Thanks a lot sir. Actually I am completely news in this field, infact in IT industry as I haven't worked before, hence I am taking a bit more time to grasp the new things.
great going Ahamika..keep practicing..
 

masroor.rizvi


Example 1 (A Simple Example on local machine)

Producer:

1618114763509.png

consumer:
1618114781315.png

Some playing around - 1:

producer:

1618115328475.png

Consumer 1 (partition 0) :

1618115373765.png

Consumer 2 (partition 1) :

1618115426534.png

Some playing around - 2

Producer with seperator explicitly defined

1618119056853.png

Consumer 1 (partition 0) : key1 is published here

1618119023347.png

Consumer 2 (partition 1) : key2 is published here

1618119080085.png

Trying with property Key print = true:

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic TestTopic2 --property "parse.key=true" --property "key.separator=:" --property print.key=true

1618119521756.png
 

Attachments

  • 1618118993233.png
    1618118993233.png
    73.9 KB · Views: 2

masroor.rizvi

Well-Known Member
Trainer

masroor.rizvi


Example 1 (A Simple Example on local machine)

Producer:

View attachment 15136

consumer:
View attachment 15137

Some playing around - 1:

producer:

View attachment 15138

Consumer 1 (partition 0) :

View attachment 15139

Consumer 2 (partition 1) :

View attachment 15140

Some playing around - 2

Producer with seperator explicitly defined

View attachment 15143

Consumer 1 (partition 0) : key1 is published here

View attachment 15142

Consumer 2 (partition 1) : key2 is published here

View attachment 15144

Trying with property Key print = true:

$KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic TestTopic2 --property "parse.key=true" --property "key.separator=:" --property print.key=true

View attachment 15145
great Dilip..any luck setting up the multi broker ? Anyone else ?
 
great Dilip..any luck setting up the multi broker ? Anyone else ?

@masroor.rizvi

Hi Syed.

I ran into an issue while setting up multi broker (Kafka). It seems the link is not available. Please provide your assistance

> vagrant up
==> vagrant: A new version of Vagrant is available: 2.2.15 (installed version: 2.2.14)!
==> vagrant: To upgrade visit: https://www.vagrantup.com/downloads.html

Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'manning/spark-in-action' could not be found. Attempting to find and install...
default: Box Provider: virtualbox
default: Box Version: >= 0
The box 'manning/spark-in-action' could not be found or
could not be accessed in the remote catalog. If this is a private
box on HashiCorp's Vagrant Cloud, please verify you're logged in via
`vagrant login`. Also, please double-check the name. The expanded
URL and error message are shown below:

URL: ["https://vagrantcloud.com/manning/spark-in-action"]
Error: The requested URL returned error: 404 Not Found
 

masroor.rizvi

Well-Known Member
Trainer

@masroor.rizvi

Hi Syed.

I ran into an issue while setting up multi broker (Kafka). It seems the link is not available. Please provide your assistance

> vagrant up
==> vagrant: A new version of Vagrant is available: 2.2.15 (installed version: 2.2.14)!
==> vagrant: To upgrade visit: https://www.vagrantup.com/downloads.html

Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'manning/spark-in-action' could not be found. Attempting to find and install...
default: Box Provider: virtualbox
default: Box Version: >= 0
The box 'manning/spark-in-action' could not be found or
could not be accessed in the remote catalog. If this is a private
box on HashiCorp's Vagrant Cloud, please verify you're logged in via
`vagrant login`. Also, please double-check the name. The expanded
URL and error message are shown below:

URL: ["https://vagrantcloud.com/manning/spark-in-action"]
Error: The requested URL returned error: 404 Not Found
hey Dilip,

This is not the multi broker set up I'm afraid. This is the Real Time Dashboard Case study.

Further, can you do an "pwd" and "ls -l" of the dir from where you are doing Vagrant Up ?

~Syed
 
hey Dilip,

This is not the multi broker set up I'm afraid. This is the Real Time Dashboard Case study.

Further, can you do an "pwd" and "ls -l" of the dir from where you are doing Vagrant Up
Hi Syed,

My bad. You are right. I misunderstood and confused with partitions and brokers. I thought one of your homework is multi-broker setup but that is single node though and has 2 partitions (2.0 - On the same cluster, creator a topic with 2 partitions)

Here is the multi-broker setup.

Two node ( node 1: Zoo keeper and broker 1 node 2: broker 2)

I retained the zookeeper.connect to listen on port 2181 on both nodes with different broker ID.

Machine 1:

hdc@m1: server.properties -> (broker.id=1,port=9092) , num.partitions=2, zookeeper.connect=localhost:2181, m2:2181, log.dirs=/tmp/kafka-logs-1

zookeeper.properties file ->

initLimit=5
syncLimit=2
server.1=m1:2666:3666
server.2=m2:2667:3667


1618204551979.png

Machine 2:

hdc@m2: server.properties -> (broker.id=2,port=9093) , num.partitions=2, zookeeper.connect=localhost:2181, m2:2181, log.dirs=/tmp/kafka-logs-2

zookeeper.properties file ->

initLimit=5
syncLimit=2
server.1=m1:2666:3666
server.2=m2:2667:3667


1618204572271.png


Please let me know if the config looks right on multi node, multi brokers .
 
April 10 - H.W. 02

Kafka cluster=1
Producer=2
Consumer=2
Partition=2

The image shows message with same key goes to the same consumer irrespective of the producer it is sent from.
P1K1 sent from either of Producer 1 or Producer 2 goes to Consumer 2
P2K2 send from either of Producer 1 or Producer 2 goes to Consumer 1
2021_04_12_22_57_32_Consumer2.png
 
Top