Preview

BigBench in Hadoop Ecosystem

Powerful Essays
Open Document
Open Document
6193 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
BigBench in Hadoop Ecosystem
A BigBench Implementation on the Hadoop
Ecosystem
Badrul Chowdhury, Tilmann Rabl, and Hans-Arno Jacobsen
Middleware Systems Research Group
University of Toronto badrul.chowdhury@mail.utoronto.ca, tilmann.rabl@utoronto.ca, jacobsen@eecg.toronto.edu http://msrg.org

Abstract. BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DI. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated.
In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized with
Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the different design choices we took and show a performance evaluation. 1

Introduction

Big data analytics is an ever growing field of research and business. Due to the drastic decrease of cost of storage and computation more and more data sources become profitable for data mining. A perfect example are online stores, while earlier online shopping systems would only record successful transactions, modern systems record every single interaction of a user with the website. The former allowed for simple basket analysis techniques, while current level of detail in monitoring makes detailed user modeling possible.
The growing demands on data management systems and the new forms of analysis have led to the development of a new breed of systems, big data management systems (BDMS). Similar to the advent of database management systems, there is a vastly growing ecosystem of diverse approaches. This leads to a dilemma for customers of BDMSs, since there are no realistic and proven measures to compare different offerings. To this



References: Proceedings of the ACM SIGMOD Conference. (2013) 2 Technical report, McKinsey Global Institute (2011) http://www.mckinsey.com/insights/mgi/research/technology_ and_innovation/big_data_the_next_frontier_for_innovation. 5. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM 51(1) (2008) 107–113 8 (2010) 1–10 7 Proceedings of the VLDB Endowment 2(2) (2009) 1626–1629 8 Communications in Computer and Information Science. Springer Berlin Heidelberg (2012) 220–234 2008. ICPADS ’08. 14th IEEE International Conference on. (Dec 2008) 11–18 11 of the Third and Fourth Workshop on Big Data Benchmarking 2013. (2014) (in print). International Symposium On High Performance Computer Architecture. HPCA (2014)

You May Also Find These Documents Helpful

  • Good Essays

    Cis 515week 3

    • 1024 Words
    • 4 Pages

    Bibliography: (2012). Database systems: Design, implementation, and management. (10 ed.). United States of America: Joe Sabatino.…

    • 1024 Words
    • 4 Pages
    Good Essays
  • Powerful Essays

    [4] Storage Conference. The Hadoop Distributed File System http://storageconference.org/ 2010/ Papers/ MSST/Shvachko.pdf [5] A Tutorial on Clustering Algorithms. K-Means Clustering http://home.dei.polimi.it/matteucc/ Clustering/ tutorial_html/kmeans.html [6] International Journal of Computer Science Issues. Setting up of an Open Source based Private Cloud http://ijcsi.org/papers/IJCSI-8-3-1-354-359.pdf [7] Eucalyptus. Modifying a prepackaged image http://open.eucalyptus.com/participate/wiki/modifyi ng-prepackaged-image [8] Michael G. Noll. Running Hadoop On Ubuntu Linux (Single-Node Cluster) http://www.michaelnoll.com/tutorials/running-hadoop-on-ubuntu-linuxsingle-node-cluster/ [9] 8K Miles Cloud Solutions. Hadoop: CDH3 – Cluster (Fully-Distributed) Setup http://cloudblog.8kmiles.com/2011/12/08/hadoopcdh3-cluster-fully-distributed-setup/ [10] Apache Mahout. Creating Vectors from Text https://cwiki.apache.org/MAHOUT/creatingvectors-from-text.html…

    • 3006 Words
    • 13 Pages
    Powerful Essays
  • Satisfactory Essays

    A group of MapReduce jobs G= {0, 1,……g} and a group of Task-Trackers SS = {0,1,…..s}. We also state m and SS to index into the sets of jobs and Task-Trackers. For each TaskTracker S we correlate a series of resources, P = {0,1,….p}. Every resource of Task-Tracker S contains a correlated capacity V. We also take into account the disk bandwidth, memory and CPU capacities for each TaskTracker and our algorithm is designed to contain other resources such as storage capacity. A MapReduce job, (m) contains a group of tasks, called as offering time, that can be shared into map tasks and reduce tasks. Each TaskTracker S gives the cluster a group of job-slots in which tasks can execute. Each job-slot is given a specific job, and the scheduler will…

    • 197 Words
    • 1 Page
    Satisfactory Essays
  • Powerful Essays

    References: Brown, B., Chiu, M., Manyika, J. (2011), Are you ready for the era of big data? Retrieved…

    • 1755 Words
    • 6 Pages
    Powerful Essays
  • Good Essays

    References: Coronel, C., Morris, S., & Rob, P. (2013). Database systems: Design, implementation, and management (10th ed.). Independence, KY: Cengage.…

    • 906 Words
    • 3 Pages
    Good Essays
  • Good Essays

    References: Coronel, C. (2013). Database Systems: Design, Implementation, and Management, Tenth Edition. Mason, Ohio, United States: Cengage Learning.…

    • 799 Words
    • 3 Pages
    Good Essays
  • Powerful Essays

    The compute framework of Hadoop is called Map Reduce. Map Reduce has been proven to the scale of…

    • 3076 Words
    • 13 Pages
    Powerful Essays
  • Good Essays

    Athabasca Assignment

    • 837 Words
    • 4 Pages

    Relational databases are not new technology. Commercially, they gained importance in the early 1980s with the introduction of Oracle’s relational database, and since then they have been an essential tool for most businesses. Databases are critical tools that help to support various business functions in an organization. These information systems help a business to build and maintain competitive advantage. Databases not only support the operational levels of business—they are also used to support the activities of managers.…

    • 837 Words
    • 4 Pages
    Good Essays
  • Powerful Essays

    Du Preez, D. (2012a). Big data: hands on or hands off? 21 Feb 2012. Computing Feature, (n.d.). Retrieved from http://www.computing.co.uk/ctg/feature/2153789/-hands-hands/page/1…

    • 1730 Words
    • 7 Pages
    Powerful Essays
  • Satisfactory Essays

    research paper

    • 329 Words
    • 2 Pages

    Zemke, F. (2012, MARCH). What 's new in SQL:2011. Retrieved September 2012, from www.sigmod.org: http://www.sigmod.org/publications/sigmod-record/1203/pdfs/10.industry.zemke.pdf…

    • 329 Words
    • 2 Pages
    Satisfactory Essays
  • Good Essays

    “Establishing a relational database management system is a great way to increase data integrity, efficiency, ask questions, sort and filter data, provide stronger security, and share information in one concise design which is very user friendly if developed correctly. Relational databases allow the ability to cross-reference data stored by the business and as a result provide a higher level of integrity.” “The key to success is in the foundation of the database's design.” “Relationships of data must be accurately assessed and once this connection is established, the rest falls into place rather smoothly. (Goessl, 2009)”…

    • 672 Words
    • 3 Pages
    Good Essays
  • Good Essays

    Case Study Big Data

    • 923 Words
    • 4 Pages

    Volvo separated from Ford in 2010, it was breaking free from an IT infrastructure that consisted of a tangle of different systems and licenses. The need was there to develop a new standalone IT infrastructure that could provide better Business Intelligence, boost communication capabilities and enrich collaborations. It will be explained how The Volvo Car Corporation transformed data into knowledge, how they integrated cloud infrastructure into its networks and the advantage The Big Data Theory gives to Volvo Car Corporation.…

    • 923 Words
    • 4 Pages
    Good Essays
  • Better Essays

    Jacobs, Adam. "The Pathologies of Big Data." Communications Acm 19 June 2014: n. pag. Google Scholar. Web. 10 Sept. 2014.…

    • 1115 Words
    • 5 Pages
    Better Essays
  • Best Essays

    Davenport, T. H., Barth, P., & Bean, R. (2012). How 'Big Data ' is different. MIT Sloan…

    • 2200 Words
    • 9 Pages
    Best Essays
  • Best Essays

    Data Warehousing and Olap

    • 2507 Words
    • 11 Pages

    Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse.…

    • 2507 Words
    • 11 Pages
    Best Essays

Related Topics