Spark: Calculating percent, mean, median, etc.

Discussion in 'Big Data and Analytics' started by Vaughn Shideler, Jun 19, 2017.

  1. Vaughn Shideler

    Joined:
    Mar 13, 2017
    Messages:
    7
    Likes Received:
    0
    I'm a bit stumped on project 1, the one with the Portuguese bank. I don't think we were ever taught in the training materials how to do some of the required operations. I start off by loading the file into Spark:

    val input = sc.textFile("/user/vshideler_gmail/project_1_data.csv")

    Performing "input.count()" shows that it is reading the file properly.

    Then I create a class named "Bank" to define the schema:

    case class Bank(age:Int, job:String, marital:String, education:String, default:String, balance:Int, housing:String, loan:String, contact:String, day:String, month:String, duration:Int, campaign:Int, pdays:String, previous:Int, poutcome:String, y:String)

    Then split the file using semicolon separation:

    val input_split = input.map(line => line.split(";"))

    And create an rdd from the data:

    val bankrdd = input_split.map(x => Bank(x(0).toInt, x(1), x(2), x(3), x(4), x(5).toInt, x(6), x(7), x(8), x(9), x(10), x(11).toInt, x(12).toInt, x(13), x(14).toInt, x(15), x(16)))


    Then create a data frame:

    val bankDF = bankrdd.toDF()

    So far, so good. The challenging part begins. It's necessary to find out the marketing success rate. By this, I assume that I need to find the "success" entries under "poutcome" and compare them to the total number of entries. So I filter the successes, assign them to a data frame and divide them by the bankDF data frame:

    val success = bankDF.filter($"poutcome" === "success")

    val successDF = success.toDF()

    val k = bankDF.count()

    val z = successDF.count()

    val x = k/z


    I'm hoping that this is correct. I don't know if you can verify, since it has to do with the project answers but I'm hoping that you can verify that the format is at least correct.

    The next problem is to figure out the "maximum, mean, and minimum age of average targeted customer" and also "check...average balance, median balance of customers."

    This is something I don't know how to do and I'm pretty sure we were never taught how to do. I know that Spark includes functions for means, averages, medians, etc., but I'm not sure how to use them. Please assist.
     
    #1
  2. Vaughn Shideler

    Joined:
    Mar 13, 2017
    Messages:
    7
    Likes Received:
    0
    Could someone address this, please?
     
    #2
  3. Vaughn Shideler

    Joined:
    Mar 13, 2017
    Messages:
    7
    Likes Received:
    0
    Can a support person please look into this?
     
    #3
  4. Megha_42

    Megha_42 Well-Known Member
    Simplilearn Support

    Joined:
    Dec 15, 2016
    Messages:
    206
    Likes Received:
    9
    Hi Vaughn,

    Thank you for reaching out. Sorry about the delay, but here's how you can conceptualize and work with your problem.
    In a dataset, you will have to group the rows and the aggregations like max, min, etc can only be applied on a group. If you would want to apply these on the full columns, you will have to group by ALL.
    After grouping you can apply operations like min, max and average.

    Here is an example,

    val df = sc.parallelize(Seq((1.0,0.3,1.0),(1.0,0.5,0.0),(-1.0,0.6,0.5),(-1.0,5.6,0.2))).toDF("col1","col2","col3")

    df.groupBy($"col1").min().show

    You could also pass a list of columns, like so,

    df.groupBy("col1").sum("col2", "col3")

    Here's the link for more examples,
    https://stackoverflow.com/questions/33882894/sparksql-apply-aggregate-functions-to-a-list-of-column

    All the very best!
     
    #4
  5. Natarajan Murugadoss(2842)

    Joined:
    Nov 15, 2014
    Messages:
    2
    Likes Received:
    1
    i used the following
    bankRDD.registerTempTable("PortuBank")
    val avg_median_bal = sqlContext.sql("Select avg(bal) as Avg_Balance , median(bal) as Median_balance from PortuBank")
    it does not work
     
    #5
  6. _12308

    _12308 New Member

    Joined:
    Sep 15, 2017
    Messages:
    1
    Likes Received:
    0
    Hi Natarajan,

    median() function is available only on Oracle Database. Here in Spark shell, we are using Hive which in turn uses MySQL DB. We don't have a median() function in MySQL.

    Instead of using median(bal), use percentile(bal,0.5) to get the median. To know more about 'percentile' as a concept google it out and to know about 'percentile' as a function mentioned here, refer https://docs.treasuredata.com/articles/hive-aggregate-functions

    Only when you understand what percentile is, you will understand what we are doing here. I've done a bit of research on google about median functions because I was stuck just like you and searching for answers regarding this. Hope this helps.
     
    #6

Share This Page