i have been trying to filter data of a CSV datafra...

Discussion in 'Big Data and Analytics' started by TEJAS KUNDU, May 21, 2017.

  1. TEJAS KUNDU

    TEJAS KUNDU New Member

    Joined:
    Jan 9, 2017
    Messages:
    1
    Likes Received:
    0
    i have been trying to filter data of a CSV dataframe in spark 1.6.0 . But i have been unsucessfull in all my attempts
    i have used the com.databricks package for loading a csv dataset into a dataframe. i have been working on the Banking Project. I have tried querying the rows where "y" equals to "yes". But it always throws an exception of column not found. Please help me to solve my problem
     
    #1
  2. Megha_42

    Megha_42 Well-Known Member
    Simplilearn Support

    Joined:
    Dec 15, 2016
    Messages:
    206
    Likes Received:
    9
    Hi Tejas,

    Thank you for reaching out.
    Glad that you have been trying out the Banking project and you have loaded the data using the databricks package.

    Could you kindly share the commands for the load and the commands that you've tried so we know the structure of the RDD/Dataframe in Spark and can guide you accordingly?
    Screenshots would be even more helpful.

    Thanks and Regards
    Megha
     
    #2
  3. Karthik Shivana

    Karthik Shivana Moderator
    Simplilearn Support Alumni

    Joined:
    Apr 1, 2016
    Messages:
    688
    Likes Received:
    32

    Hi Tejas,

    Please change the file format and then try. like txt.

    Please let us know, if you still find any issues on the same.

    Regards
    Karthik
     
    #3
  4. Somashekhar (3400)

    Somashekhar (3400) Active Member

    Joined:
    Oct 7, 2013
    Messages:
    17
    Likes Received:
    1
    Hi Tejas,
    I have also gone through same issue, and i found that the file format is bad, you just need to correct the file.I did it using google drive. if you observe the file column names are having '' " ,still you need help let me know by sending your email id, i will give the corrected file and you can use it. Thanks
     
    #4
  5. Aayush_11

    Aayush_11 New Member

    Joined:
    Feb 22, 2017
    Messages:
    1
    Likes Received:
    0
    I am having the same issue can you send me the corrected file : aayush.y1994@gmail.com

    or you can suggest changes
     
    #5
    Last edited: Jun 23, 2017
  6. _6230

    _6230 Well-Known Member
    Alumni

    Joined:
    Apr 4, 2017
    Messages:
    185
    Likes Received:
    8
    could you please reply answer here as well
     
    #6
  7. firozgade

    firozgade Member

    Joined:
    Mar 24, 2016
    Messages:
    2
    Likes Received:
    0
    Kindly forward me the new file to : firoz.gade@gmail.com.I am also facing the same issue.I would be helpful if you forward the new file.
     
    #7
  8. _6296

    _6296 Member
    Alumni

    Joined:
    Apr 6, 2017
    Messages:
    9
    Likes Received:
    0
    having same issue. Can you please provide what correction was done on the file. Please send me the file k_soumen@yahoo.com
     
    #8
  9. hasitha.rg

    hasitha.rg Member
    Alumni

    Joined:
    Apr 30, 2015
    Messages:
    2
    Likes Received:
    0
    ---------------------

    Hi Megha,

    I've been trying to load data using databricks package as well but it seems that the format of the the column "AGE" is misformed in the given dataset.

    Data has the following ( semicolon within the data field)
    "age;"
    "58;"

    Is the dataset misformed on purpose? Are we allowed to edit the dataset ( in notepad++)?

    Here are the commands I've used:

    val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter", ";").option("inferSchema","true").option("escape", """).load("mkt_campaign.csv")

    scala> val selectedData = df.select("age","marital","balance","y")
    org.apache.spark.sql.AnalysisException: cannot resolve 'age' given input columns
     
    #9
  10. Amit Himani

    Amit Himani Customer
    Customer

    Joined:
    Apr 5, 2017
    Messages:
    7
    Likes Received:
    1
    my thoughts on couple of things:
    Input file is having this format intentionally to make it challanging like production
    I have used Spark 2.1 - as it is latest but due to some reason our cloud lab still has spark 1.6

    Below code is working for me in Spark 2.1.1 - it is splitting the string by ";" and ignoring double quotes.

    case class Bank(age:Integer, job:String, marital : String, education : String, isdefault: String, balance : Integer, housing: String, loan:String, contact:String, month:String, day_of_week:String, duration: Integer, campaign:Integer, pdays:Integer, previous:Integer, poutcome:String, isSuccess:String)

    val bankrdd = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(s->
    s=>Bank(s(0).toInt,
    s(1).replaceAll("\"", ""),
    s(2).replaceAll("\"", ""),
    s(3).replaceAll("\"", ""),
    s(4).replaceAll("\"", ""),
    s(5).replaceAll("\"", "").toInt,
    s(6).replaceAll("\"", ""),
    s(7).replaceAll("\"", ""),
    s(8).replaceAll("\"", ""),
    s(9).replaceAll("\"", ""),
    s(10).replaceAll("\"", ""),
    s(11).replaceAll("\"", "").toInt,
    s(12).replaceAll("\"", "").toInt,
    s(13).replaceAll("\"", "").toInt,
    s(14).replaceAll("\"", "").toInt,
    s(15).replaceAll("\"", ""),
    s(16).replaceAll("\"", "")
    )
    )
     
    #10
  11. Megha_42

    Megha_42 Well-Known Member
    Simplilearn Support

    Joined:
    Dec 15, 2016
    Messages:
    206
    Likes Received:
    9
    Hi Hasitha,

    Your interpretation is right, the dataset is intentionally misformed so as to keep it closer to the real world scenario. And you are free to change the symbols of the datasets as per your convenience, but you cannot change the data itself.

    Preferably, you can clean the data using Scala's/Python's own string utilities while you are working on Spark. Also, when loading into a Spark Dataframe, you might need to omit the first row as the first row in the file contains the column name.

    Hint:
    Pay close attention to special symbols, orphan quotes and null values in the file.

    Hope this clarifies your doubts

    All the very best!
     
    #11

Share This Page