BigBench in Hadoop Ecosystem

A BigBench Implementation on the Hadoop
Ecosystem
Badrul Chowdhury, Tilmann Rabl, and Hans-Arno Jacobsen
Middleware Systems Research Group
University of Toronto badrul.chowdhury@mail.utoronto.ca, tilmann.rabl@utoronto.ca, jacobsen@eecg.toronto.edu http://msrg.org

Abstract. BigBench is the ﬁrst proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DI. The ﬁrst proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test other other systems, the queries have to be translated.
In this paper, an alternative implementation of BigBench for the Hadoop ecosystem is presented. All 30 queries of BigBench were realized with
Apache Hive, Apache Hadoop, Apache Mahout, and NLTK. We will present the diﬀerent design choices we took and show a performance evaluation. 1

Introduction

Big data analytics is an ever growing ﬁeld of research and business. Due to the drastic decrease of cost of storage and computation more and more data sources become proﬁtable for data mining. A perfect example are online stores, while earlier online shopping systems would only record successful transactions, modern systems record every single interaction of a user with the website. The former allowed for simple basket analysis techniques, while current level of detail in monitoring makes detailed user modeling possible.
The growing demands on data management systems and the new forms of analysis have led to the development of a new breed of systems, big data management systems (BDMS). Similar to the advent of database management systems, there is a vastly growing ecosystem of diverse approaches. This leads to a dilemma for customers of BDMSs, since there are no realistic and proven measures to compare diﬀerent oﬀerings. To this

References: Proceedings of the ACM SIGMOD Conference. (2013) 2 Technical report, McKinsey Global Institute (2011) http://www.mckinsey.com/insights/mgi/research/technology_ and_innovation/big_data_the_next_frontier_for_innovation. 5. Dean, J., Ghemawat, S.: MapReduce: Simpliﬁed Data Processing on Large Clusters. Communications of the ACM 51(1) (2008) 107–113 8 (2010) 1–10 7 Proceedings of the VLDB Endowment 2(2) (2009) 1626–1629 8 Communications in Computer and Information Science. Springer Berlin Heidelberg (2012) 220–234 2008. ICPADS ’08. 14th IEEE International Conference on. (Dec 2008) 11–18 11 of the Third and Fourth Workshop on Big Data Benchmarking 2013. (2014) (in print). International Symposium On High Performance Computer Architecture. HPCA (2014)

BigBench in Hadoop Ecosystem

You May Also Find These Documents Helpful

Cis 515week 3

Cis 515week 3

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Unit 3 Assignment 1 Task Tracker

Unit 3 Assignment 1 Task Tracker

Amba 640 Individual Assignment 2 Revise

Amba 640 Individual Assignment 2 Revise

Assignment 1 Business Rules and Data Models

Assignment 1 Business Rules and Data Models

Database Environment Week 2

Database Environment Week 2

Hadoop Discrimination Research Paper

Hadoop Discrimination Research Paper

Athabasca Assignment

Athabasca Assignment

Annotated Bibliography on four peered reviewed journals

Annotated Bibliography on four peered reviewed journals

research paper

research paper

Relational Database Concepts and Applications: Research Paper

Relational Database Concepts and Applications: Research Paper

Case Study Big Data

Case Study Big Data

Customer Relationship Paper

Customer Relationship Paper

Big Data Architecture, Goals and Challenges

Big Data Architecture, Goals and Challenges

Data Warehousing and Olap

Data Warehousing and Olap

Related Topics