Proces 1Peta Byte Data

Discussion in 'Big Data and Analytics' started by Somashekhar (3400), May 16, 2017.

  1. Somashekhar (3400)

    Somashekhar (3400) Active Member

    Joined:
    Oct 7, 2013
    Messages:
    17
    Likes Received:
    1
    Dear Members,

    I am giving here a problem statement,Please provide me the steps to be followed to get the solution.
    Problem: I have a 1Peta Byte size data .I have written Scala program to process the data in spark environment.I have a hadoop cluster of - 1Name Node: 500GB hard disk,32GB RAM, 5 Data Nodes :500GB hard disk ,16GB RAM each.

    Please explain the steps to process the 1Peta Byte data using Spark(scala).I need the steps from beginning ,you can assume anything which is not stated in the problem.Steps need to in very detail, i such way that anybody can do the processing.

    Thanks,
    Somashekhar
     
    #1
  2. Megha_42

    Megha_42 Well-Known Member
    Simplilearn Support

    Joined:
    Dec 15, 2016
    Messages:
    206
    Likes Received:
    9
    Hi Somashekhar,

    Thank you for reaching out.
    1 PetaByte is a huge size! It is equal to 10,00,000 GB.
    If you want to statically save the data within the Hadoop set-up you have mentioned, it's impossible. You will need to have another storage from where you will have to import the data to HDFS one chunk after another. That itself is a major task, that can be done in various ways.
    You can arrange to chunk the data and load it into HDFS one chunk after another. Please note that this process needs to be outside of the Hadoop set-up, so you will need more storage and processing dedicated for this process.

    Another way could be an arrangement to stream the data by delegating the pumping of data to an external entity into HDFS.
    Once a particular chunk or stream is arranged, you can either use SparkSQL or Spark Streaming respectively to ingest and process the data.
    Could I also ask what logic you had in mind for processing?

    Thanks and Regards
    Megha
     
    #2
  3. Somashekhar (3400)

    Somashekhar (3400) Active Member

    Joined:
    Oct 7, 2013
    Messages:
    17
    Likes Received:
    1
    Hi Megha,
    Thanks for replying(partially) , My idea is to understand the data ingestion into HDFS from external world of data.The logic to process the data is simple Spark program using Scala(even it would be find out some pattern among the data or simple word count), I have done experiments using small chunks of data which is obviously small data, when i say i have done something in big data ,it is important to now what are the possible ways to ingest the data(really big size -PB). when i say it's 1PB data, it is definitely in chunks - chunk size depending on the file system it came from. The data size is 1PB means it's 1024 TB and the data files are stored in multiple hard disks . If suppose i have 100 hard disks where data files are stored, what is the way to push the data into HDFS?.whether data file is 100MB or 1PB, processing program is same right(as of my understanding)?.even in this scenario my data file is bigger and in one go i may not be able to place all the data in HDFS storage as first step, in this scenario also what are the possibilities to process the data. Since there are experts in Big Data domain from Simplilearn and participating in community , i wanted to understand from you guys.Please give me suggestions.
    Thanks Somashekhar
     
    #3

Share This Page