# Spark: Calculating percent, mean, median, etc.

Discussion in 'Big Data and Analytics' started by Vaughn Shideler, Jun 19, 2017.

1. ### Vaughn Shideler Member

Joined:
Mar 13, 2017
Messages:
7
0
I'm a bit stumped on project 1, the one with the Portuguese bank. I don't think we were ever taught in the training materials how to do some of the required operations. I start off by loading the file into Spark:

val input = sc.textFile("/user/vshideler_gmail/project_1_data.csv")

Performing "input.count()" shows that it is reading the file properly.

Then I create a class named "Bank" to define the schema:

case class Bank(age:Int, job:String, marital:String, education:String, default:String, balance:Int, housing:String, loan:String, contact:String, day:String, month:String, duration:Int, campaign:Int, pdays:String, previous:Int, poutcome:String, y:String)

Then split the file using semicolon separation:

val input_split = input.map(line => line.split(";"))

And create an rdd from the data:

val bankrdd = input_split.map(x => Bank(x(0).toInt, x(1), x(2), x(3), x(4), x(5).toInt, x(6), x(7), x(8), x(9), x(10), x(11).toInt, x(12).toInt, x(13), x(14).toInt, x(15), x(16)))

Then create a data frame:

val bankDF = bankrdd.toDF()

So far, so good. The challenging part begins. It's necessary to find out the marketing success rate. By this, I assume that I need to find the "success" entries under "poutcome" and compare them to the total number of entries. So I filter the successes, assign them to a data frame and divide them by the bankDF data frame:

val success = bankDF.filter(\$"poutcome" === "success")

val successDF = success.toDF()

val k = bankDF.count()

val z = successDF.count()

val x = k/z

I'm hoping that this is correct. I don't know if you can verify, since it has to do with the project answers but I'm hoping that you can verify that the format is at least correct.

The next problem is to figure out the "maximum, mean, and minimum age of average targeted customer" and also "check...average balance, median balance of customers."

This is something I don't know how to do and I'm pretty sure we were never taught how to do. I know that Spark includes functions for means, averages, medians, etc., but I'm not sure how to use them. Please assist.

#1

Joined:
Mar 13, 2017
Messages:
7
0

#2
3. ### Vaughn Shideler Member

Joined:
Mar 13, 2017
Messages:
7
0
Can a support person please look into this?

#3
4. ### Megha_42 Well-Known Member Simplilearn Support

Joined:
Dec 15, 2016
Messages:
206
9
Hi Vaughn,

Thank you for reaching out. Sorry about the delay, but here's how you can conceptualize and work with your problem.
In a dataset, you will have to group the rows and the aggregations like max, min, etc can only be applied on a group. If you would want to apply these on the full columns, you will have to group by ALL.
After grouping you can apply operations like min, max and average.

Here is an example,

val df = sc.parallelize(Seq((1.0,0.3,1.0),(1.0,0.5,0.0),(-1.0,0.6,0.5),(-1.0,5.6,0.2))).toDF("col1","col2","col3")

df.groupBy(\$"col1").min().show

You could also pass a list of columns, like so,

df.groupBy("col1").sum("col2", "col3")

Here's the link for more examples,
https://stackoverflow.com/questions/33882894/sparksql-apply-aggregate-functions-to-a-list-of-column

All the very best!

#4
5. ### Natarajan Murugadoss(2842) Member

Joined:
Nov 15, 2014
Messages:
2
1
i used the following
bankRDD.registerTempTable("PortuBank")
val avg_median_bal = sqlContext.sql("Select avg(bal) as Avg_Balance , median(bal) as Median_balance from PortuBank")
it does not work

#5

Joined:
Sep 15, 2017
Messages:
1