Preview

Map Join Reduce

Better Essays
Open Document
Open Document
1387 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Map Join Reduce
OPTIMIZATION OF MULTISET DATA ANALYSIS ON HADOOP USING MAP JOIN REDUCE

A PROJECT REPORT
Submitted by

SHENBAGA PRIYA.B
09ITR105

SILAMBARASAN.R
09ITR108

VIGNESWARI.A
09ITR125 in partial fulfilment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY IN INFORMATION TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY SCHOOL OF COMMUNICATION AND COMPUTER SCIENCES

KONGU ENGINEERING COLLEGE
(Autonomous)

PERUNDURAI ERODE – 638 052

APRIL 2013

ABSTRACT

Data analysis is the process of inspecting, cleaning, transforming and modeling data with the goal of highlighting useful information, suggesting conclusions and supporting decision making, which is considerable in cloud computing which allows a large amount of data to be processed over very large clusters. MapReduce is used to handle data in the cloud environment especially in distributed environment because of its excellent scalability and good fault tolerance. But, compared to parallel databases, the efficiency of MapReduce is not efficient when it is adopted to perform complex data analysis which includes joining of multiple data sets in order to compute certain aggregates. A system called Map Join Reduce, which performs complex data analytical task effectively when compared to existing, is proposed. Filtering-join-aggregation model, an extension of MapReduce’s filtering aggregation programming model is introduced. First it performs filtering logic to the data sets and processed in pipelined manner, then groups the output and produces the final result. The significance of our proposal is that, aggregate multiple data sets in one go and thus reduce checkpoints which perform often in existing system and shuffling of intermediate results which results in efficiency of data processing in distributed applications.

INTRODUCTION

In Information Technology, big data is a collection of data sets which is too large and complex that it becomes difficult to process using



References: 1. Afrati.F.N and Ullman.J.D.(2010) ‘Optimizing Joins in a Map-Reduce Environment,’ Proc. 13th Int’l Conf. Extending Database Technology(EDBT ’10). 2. Chuck Lam. (2010) ‘Hadoop in action’, Manning publications. 3. Dawei Jiang, Anthony K. H. Tung, and Gang Chen. (2011) ‘MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters’, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 9. 4. Dean.J and Ghemawat.S. (2004) ‘MapReduce: Simplified Data Processing on Large Clusters,’ Proc. Operating Systems Design and implementation (OSDI), pp. 137-150. 5. Yang.H.C, Dasdan.A, HsiaoR.L, and Parker.D.S. (2007) ‘Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters,’ Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’07).

You May Also Find These Documents Helpful

  • Good Essays

    2.Mapreduce framework: It is a programming model for large scale data processing in distributed manner. There are 2 major steps in map reduce : Map and reduce…

    • 496 Words
    • 2 Pages
    Good Essays
  • Good Essays

    Data normalization is very important in transactional, or the online transactional processing database world where many data modifications take place constantly and randomly throughout the stored data. In contrast to that, the data warehouse will contain a substantial amount of denormalized and summarized data that is…

    • 752 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    data is widely available and there is an imminent need for turning such data into useful information. This need is fulfilled by the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data provided by Data Mining. In case of a single system with few processors, there are restrictions on the speed of processing as well as the size of the data that can be processed at a time. The speed as well as the limit on the size of the data to be processed can be increased if data mining is carried out in parallel fashion with the help of the coordinated systems connected in LAN. But the problem with this solution is that LAN is not elastic, i.e. the number of systems in which the work is to be distributed on basis of the size of the data to be processed cannot be changed. Our main aim is to distribute data to be analyzed in various nodes in cloud. For optimum data distribution and efficient data mining as per user’s desire, various algorithms must be implemented.…

    • 3006 Words
    • 13 Pages
    Powerful Essays
  • Satisfactory Essays

    A group of MapReduce jobs G= {0, 1,……g} and a group of Task-Trackers SS = {0,1,…..s}. We also state m and SS to index into the sets of jobs and Task-Trackers. For each TaskTracker S we correlate a series of resources, P = {0,1,….p}. Every resource of Task-Tracker S contains a correlated capacity V. We also take into account the disk bandwidth, memory and CPU capacities for each TaskTracker and our algorithm is designed to contain other resources such as storage capacity. A MapReduce job, (m) contains a group of tasks, called as offering time, that can be shared into map tasks and reduce tasks. Each TaskTracker S gives the cluster a group of job-slots in which tasks can execute. Each job-slot is given a specific job, and the scheduler will…

    • 197 Words
    • 1 Page
    Satisfactory Essays
  • Good Essays

    References: Coronel, C., Morris, S., & Rob, P. (2013). Database systems: Design, implementation, and management (10th ed.). Independence, KY: Cengage.…

    • 906 Words
    • 3 Pages
    Good Essays
  • Good Essays

    Week 6 Discussion 2

    • 582 Words
    • 3 Pages

    Any organization wishing to maintain a competitive advantage can benefit from big data management and analytical tools. When properly utilized, big data can increase efficiency, productivity, and predict future market conditions (Laudon, p. 231). As processors become faster and more affordable, big data management will become a necessary component of all organizations. The actual benefit from big data will lie in the ability to analyze and apply the vast amounts of information that are flooding databases at all times.…

    • 582 Words
    • 3 Pages
    Good Essays
  • Good Essays

    References: Coronel, C. (2013). Database Systems: Design, Implementation, and Management, Tenth Edition. Mason, Ohio, United States: Cengage Learning.…

    • 799 Words
    • 3 Pages
    Good Essays
  • Good Essays

    Hadoop Thesis Statement

    • 381 Words
    • 2 Pages

    In the modern world, the delivery of treatment models is rapidly changing and many decisions are made behind these changes are driven by using data. In today’s world, it is becoming much important to understand as much as possible about the patient by picking up the warning signs of illness at early stage of treatment than later stage. So, Big data and Hadoop in healthcare sector are being used to predict occurrence of diseases, cure diseases, improve the quality…

    • 381 Words
    • 2 Pages
    Good Essays
  • Good Essays

    Resume

    • 1027 Words
    • 5 Pages

    * Strong knowledge of data modeling, data cleansing and transformation, data importing and exporting, monitoring, and reporting in a cloud computing environment.…

    • 1027 Words
    • 5 Pages
    Good Essays
  • Powerful Essays

    References: [1] R. Agrawal, T. Imielinski, A. Swami, Proceeding of the ACM SIGMOD Conference on Management of Data, 1993. [2] R. Agrawal, R. Srikant. The International Conference on Very Large Databases, 1994. [3] I. H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, chapter 8, http://www.cs.waikato.ac.nz, 2000. [4] A. Roberts, Guide to WEKA, http://www.comp.leeds.ac.uk/andyr, 2005. [5] M. Levy, B. Weitz, Retailing Management, McGraw-Hill, New York, 2001.…

    • 2113 Words
    • 9 Pages
    Powerful Essays
  • Better Essays

    This paper is trying to demonstrate how Airavat, a MapReduce-based system for distributed computations provides end-to-end confidentiality, integrity, and privacy guarantees using a combination of mandatory access control and differential privacy which provides security and privacy guarantees against data leakage.…

    • 762 Words
    • 4 Pages
    Better Essays
  • Best Essays

    Davenport, T. H., Barth, P., & Bean, R. (2012). How 'Big Data ' is different. MIT Sloan…

    • 2200 Words
    • 9 Pages
    Best Essays
  • Good Essays

    Industry 4.0 Analysis

    • 806 Words
    • 4 Pages

    Real-time big data represents the process of keeping a great deal of data in a data warehouse and discovering interesting patterns and knowledge from large amounts of data. It can be considered the result owing to the natural evolution of information technology and an essential process, where intelligent methods are leveraged to extract data patterns and discover knowledge from data. The data sources can include databases, data warehouses, the web, other information repositories, or data that are streamed into system dynamically. Data Mining is capable to discover and analyze patterns, rules and excavate knowledge from big data collected from multiple sources. So the right decision can be made at the right time and right…

    • 806 Words
    • 4 Pages
    Good Essays
  • Good Essays

    Big Data Big Reward

    • 684 Words
    • 3 Pages

    There are mainly three kinds of big data collected by the organizations described in this case. First, IBM Bigsheets help the British Library to handle with huge quantities of data and extract the useful knowledge. Second, State and federal law enforcement agencies are analyzing big data to discover hidden patterns in criminal activity. The Real Time Crime Center data warehouse contains millions of data points on city crime and criminals. Third, Vestas implemented a solution consisting of IBM InfoSphere BigInsights software running on a high-performance IBM System x iDataPlex server.…

    • 684 Words
    • 3 Pages
    Good Essays
  • Best Essays

    Data Warehousing and Olap

    • 2507 Words
    • 11 Pages

    Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse.…

    • 2507 Words
    • 11 Pages
    Best Essays

Related Topics