Welcome to the Simplilearn Community

Want to join the rest of our members? Sign up right away!

Sign Up

WIPRO - BIGDATA ACADEMY

Dhrub K Thakur

Member
Customer
Cloudera server is running but getting below error in log file.

ERROR ScmActive-0:com.cloudera.server.cmf.components.ScmActive: ScmActive : Unable to retrieve non-local non-loopback IP address. Seeing address: localhost/127.0.0.1
 

Kasim Khan H

Member
Customer
Hi,

I created a dataframe for project1:

Code:
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("inferSchema", "true").
load("project1.csv")

I tried to read data:

Code:
df.select("y").show()

Got the below error:

org.apache.spark.sql.AnalysisException: cannot resolve 'y' given input columns age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y";

Please assist.
Hi Ganesh,

I did face the same issue. This could be due to improper dataset format. It considered the first line of the whole data set as row header.
The problem now is that the data frame could not guess the column names properly. While creating dataframe the response should be like below to have a proper table like data frame.

" df: org.apache.spark.sql.DataFrame = [age: string, job: string, marital: string, education: string, default: string, balance: string, housing: string, loan: string, contact: string, day: string, month: string, duration: string, campaign: string, pdays: string, previous: string, poutcome: string, y: string] "

What i did is there below to solve this issue.

1. Open the data set in excel
2. Select all rows -> Data Tab -> Text to Columns -> select Delimited -> click next -> check Semicolon -> click finish
3. Save the excel with someOtherName.csv
4. Now repeat the steps u did to create a data frame and do select on dataframe with column. this should work.

Hope this helps.
 

Kasim Khan H

Member
Customer
H
HI Manish,


We are working on it this will be resolved in 24 hrs. The new data set will be available with the course project data sets.
i Deshdeep,

I'm wondering what needs to be done for project 2 on K-means algorithm. There is no data set. Could you please empasize on what needs to done here. Thanks in advance.
 

Ganesh Padaiyachi

Member
Customer
Hi Ganesh,

I did face the same issue. This could be due to improper dataset format. It considered the first line of the whole data set as row header.
The problem now is that the data frame could not guess the column names properly. While creating dataframe the response should be like below to have a proper table like data frame.

" df: org.apache.spark.sql.DataFrame = [age: string, job: string, marital: string, education: string, default: string, balance: string, housing: string, loan: string, contact: string, day: string, month: string, duration: string, campaign: string, pdays: string, previous: string, poutcome: string, y: string] "

What i did is there below to solve this issue.

1. Open the data set in excel
2. Select all rows -> Data Tab -> Text to Columns -> select Delimited -> click next -> check Semicolon -> click finish
3. Save the excel with someOtherName.csv
4. Now repeat the steps u did to create a data frame and do select on dataframe with column. this should work.

Hope this helps.

Thank you so much Kasim. Will try that.
 

Kasim Khan H

Member
Customer
Hi All,

Can somebody explain this from banking project?

"4. Check quality of customers by checking average balance, median balance of customers"

I am not sure what to do for this. Do i have to just calculate average and median balance of customers ?
 

adarsh pattar

Member
Customer
Hi All,

Can somebody explain this from banking project?

"4. Check quality of customers by checking average balance, median balance of customers"

I am not sure what to do for this. Do i have to just calculate average and median balance of customers ?
the way i understood is to first calculate average and median balance from balance column and then classify the customers by calculating number of customers with balance above and below average,if the customer base is 50% or more with balance above calculated avg, then we may conclude customer base as '1' otherwise '-1'
 

Ravi_272

Member
Customer
Deshdeep/Shivank,

There are couple of questions in simulation test where some data format is referred to, but the data format is not given in the question. Can you please take a look at that and missing information?
 

Ravi_272

Member
Customer
Desheep/Shivank,

The question #10 in simulation test refers to a text file in cloud lab but I am not able to access the file.

/user/simplilearn/Question37/File.txt

Can you please provide more details about it?
 

Kasim Khan H

Member
Customer
the way i understood is to first calculate average and median balance from balance column and then classify the customers by calculating number of customers with balance above and below average,if the customer base is 50% or more with balance above calculated avg, then we may conclude customer base as '1' otherwise '-1'


Thanks Adarsh.

Could you let me know how to calculate median in spark ? I am having a hard time finding median.

Thanks again.
 

adarsh pattar

Member
Customer
Thanks Adarsh.

Could you let me know how to calculate median in spark ? I am having a hard time finding median.

Thanks again.
since the number of entries in the "balance" column are odd, median would be middle(center) value of sorted entries of balance column(as per std definition of median).
 

MEGHA AHLUWALIA

Member
Customer
hi Deshdeep,

i was trying one of the simulation test and I could not find quick.txt or employee folder in user/simplilearn.

Can you please check and let me know.

Thanks,
 

Kasim Khan H

Member
Customer
since the number of entries in the "balance" column are odd, median would be middle(center) value of sorted entries of balance column(as per std definition of median).


I understand that, but i was able to do only up to sorting the Dataframe on balance column, but reading the middle row has been a difficult task using dataframe or rdd
 

Ramu Umesh Pekala

Member
Customer
Dear Friends,

Can some one know about these attributes given in porject1 and what for they are used.

# social and economic context attributes
16 - emp.var.rate: employment variation rate―quarterly indicator (numeric)
17 - cons.price.idx: consumer price index―monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index―monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate―daily indicator (numeric)
20 - nr.employed: number of employees―quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the customer subscribed a term deposit? (binary: 'yes', 'no')
 

Kasim Khan H

Member
Customer
you use zipwithindex function after sorting, you can directly call middle row value using index
submitted this project already.. i tried and failed with using zipwithindex though... :(
Thank u so much for the help!
One more last questions, has k-means algorithm already been taught?
 

Kasim Khan H

Member
Customer
Dear Friends,

Can some one know about these attributes given in porject1 and what for they are used.

# social and economic context attributes
16 - emp.var.rate: employment variation rate―quarterly indicator (numeric)
17 - cons.price.idx: consumer price index―monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index―monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate―daily indicator (numeric)
20 - nr.employed: number of employees―quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the customer subscribed a term deposit? (binary: 'yes', 'no')

16 - 20 are no use as far as the task for the project is concerned.. but "y" is the key column for most of the task for this project.
 

adarsh pattar

Member
Customer
submitted this project already.. i tried and failed with using zipwithindex though... :(
Thank u so much for the help!
One more last questions, has k-means algorithm already been taught?
zipwithindex worked for me..:) . yes ,K-mean clustering algorithm has been explained in the class . they have not updated the second project data set yet?
 

Kasim Khan H

Member
Customer
zipwithindex worked for me..:) . yes ,K-mean clustering algorithm has been explained in the class . they have not updated the second project data set yet?


I missed on last couple of classes. Was this algorithm explained in last two classes ? Any insight in this will be really helpful.
I see a data set with project 2 but it does not seem to be the right data.
 

adarsh pattar

Member
Customer
I missed on last couple of classes. Was this algorithm explained in last two classes ? Any insight in this will be really helpful.
I see a data set with project 2 but it does not seem to be the right data.
ok, he is currently explaining it again right now in doubtclearing session, i hope you are attending it , its started at 4'0 clk itslef
 

Arun_4681

Member
Customer
Any body Idea to solve this question.
4. Check quality of customers by checking average balance, median balance of customers

we need to find average balance and then what we need to do.

Thanks in Advance.
 

VATTIPALLI ARUNA

Active Member
Customer
Did anyone get the new data for proj2....i see no new notifications or any sort to update......

If anyone have the new data...please share a sample of that........
 

adarsh pattar

Member
Customer
Did anyone get the new data for proj2....i see no new notifications or any sort to update......

If anyone have the new data...please share a sample of that........
yes the new data set is available in projects folder under downloads tabs. you have to dowload whole projects folder again, you will get new data set under project 2 folder
 

Arun_4681

Member
Customer
Hi All,
I am not able to upload second project data set to HDFS folder. I am able to upload only 5MB data. Anyone successfully moved full 40 MB file to HDFS folder?
 

praveen.rachapally

Member
Customer
hi
this query is regarding map-reduce programs,
1. how can we define number of partitioners
2. let's say we have two reducers (r1, r2) defined, can we force the part of data present in file split to be processed by only by r1. please explain.
3. when the combiner will come into picture and processes the data
file -> file split -> mapper -> partition -> merge -> sort -> group by -> reducer -> output
 

Srinivasulu Kuruva

Member
Customer
Not able to load CSV file with load api tryied both
scala> val cars = sqlContext.load("/user/manish.pundir_wipro/spark/emp.csv","com.databricks.spark.csv")warning: there were 1 deprecation warning(s); re-run with -deprecation for detailsjava.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org

val cars = sqlContext.load("/user/manish.pundir_wipro/spark/emp.csv","csv")warning: there were 1 deprecation warning(s); re-run with -deprecation for detailsjava.lang.ClassNotFoundException: Failed to find data source: csv. Please find packages at http://spark-packages.org
Give like this
val data = sc.textFile("FilePath" + "/FileName.csv");
Example:
val data = sc.textFile("/home/srinivasulu.kuruva_wipro" + "/Project1_dataset_bank-full.csv");
 

DeshDeep Singh

Well-Known Member
Simplilearn Support
Alumni
Hey Guys (from batch 1 & 2),

I have gone through your project which were submitted till date, most of the project looks strong with some great description, rest are small and crisp with necessary information.

Only 3 projects need improvement and i have already sent a mail across to them personally. Please check and share your updated projects with us soon.
 

adarsh pattar

Member
Customer
Hi deshdeep, in simulation exam 4, for question number 5 , the given dataset/path does not exist pls help. below is the given path to access dataset: "/user/Simplilearn/2013-09-15.log"
 

Megha_42

Well-Known Member
Simplilearn Support
Hi All,

It's so glad to see all of you so immersive in the course!
We've been through intensive hands-on all week. Let this weekend prove to be the bridge to your mastery!
Here are your goals for the weekend,

Flume
-----
1. Get Twitter Messages
2. Network Statistics -> Spooldir/netstats

Pig
------
Top 20 Most frequent words having at least 4 characters

Spark
-----
Go through Scala basics
Spark: Basic operations on RDD

And don't forget to keep yourself in touch with the roots, cover your e-learning videos!

Have a wonderful weekend.

Regards
Megha
 

praveen.rachapally

Member
Customer
objectMarketingAnalysis{
caseclassCustomer(age:Int,job:String,marital:String,education:String,default:String,balance:Int,housing:String,loan:String,contact:String,day:Int,month:String,duration:Int,campaign:Int,pdays:Int,previous:Int,poutcome:String,y:String)

defrunLoadDataCreateDF():Unit={
System.out.println("insiderunLoadDataCreateDFmethod");
importspark.implicits._
valcustomersdata=sc.textFile("/user/praveen.rachapally_wipro/project1_bank/Project1_dataset_bank-full.csv");
//customersdata.saveAsTextFile("/user/praveen.rachapally_wipro/project1_bank/customersdata416");
valheader=customersdata.first();
valcustomers=customersdata.filter(row=>row!=header);
valcustomersDF=customers.map(_.replace("\"","")).map(_.split(";")).map(cust=>Customer(cust(0).trim.toInt,cust(1).trim,cust(2).trim,cust(3).trim,cust(4).trim,cust(5).trim.toInt,cust(6).trim,cust(7).trim,cust(8).trim,cust(9).trim.toInt,cust(10).trim,cust(11).trim.toInt,cust(12).trim.toInt,cust(13).trim.toInt,cust(14).trim.toInt,cust(15).trim,cust(16).trim)).toDF();
customersDF.show(10);
}
defmain(args:Array[String]){System.out.println("insidemainmethod");runLoadDataCreateDF();System.out.println("mainmethodEND");
}}

copy pasted above code in cloudlab and ran the program using below code, getting error, any idea how to resolve this issue?

scala> MarketingAnalysis.main(Array())

inside main method
inside runLoadDataCreateDF method
org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
 

praveen.rachapally

Member
Customer
how can we run the code as shown below to compile the spark-scala program in cloudlab
$ scalac -classpath "spark-core_2.10-1.3.0.jar:/usr/local/spark/lib/spark-assembly-1.4.0-hadoop2.6.0.jar" SparkPi.scala
 

Ramu Umesh Pekala

Member
Customer
Dear Friends,

Hope that some of you have completed project1, I was not clear on the below requirement, can you please help to understand this.

5. Check if age matters in marketing subscription for deposit
6. Check if marital status mattered for subscription to deposit.
7. Check if age and marital status together mattered for subscription to deposit scheme
8. Do feature engineering for column—age and find right age effect on campaign
 

Arun_4681

Member
Customer
Dear Friends,

Hope that some of you have completed project1, I was not clear on the below requirement, can you please help to understand this.

5. Check if age matters in marketing subscription for deposit
6. Check if marital status mattered for subscription to deposit.
7. Check if age and marital status together mattered for subscription to deposit scheme
8. Do feature engineering for column—age and find right age effect on campaign
This was answered by Deshdeep already,
1. We need to check that how many people of particular age group have said yes for deposit. get the age and count of age who said yes for deposit.
2. Same as above
3. Same as above

8. you need to find which age group has more people who said to deposit
 
Top