Cloud and Big Data Day 2013 - speakers and abstracts

Keynotes

See program.

Information Management in the Cloud -Big Data Analytics Beyond Map/Reduce

Volker Markl, Technische Universität Berlin (TU Berlin)

The talk will present cloud information management and big data analytics. We will survey drivers and highlight current trends in big data management. After surveying big data analytics, with its challenges and opportunities we will present a new flavor of data processor that goes beyond the popular map/reduce paradigm. We propose a programming model based on second order functions that describe what we call parallelization contracts (PACTs). PACTs are a generalization of the map/reduce programming model, extending it with additional higher order functions and output contracts that give guarantees about the behaviour of a function. A PACT program is transformed into a data flow for a massively parallel execution engine, which executes its sequential building blocks in parallel and provides communication, synchronization and fault tolerance. The concept of PACTs allows the system to abstract parallelization from the specification of the data flow and thus enables several types of optimizations on the data flow. The system as a whole is as generic as map/reduce systems, but can provide higher performance through optimization and adaptation of the system to changes in the execution environment. Moreover, it enables the execution of tasks that traditional map/reduce systems cannot execute without mixing data flow program specification and parallelization, like joins, time-series analysis or data mining operations.

Volker Markl

Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) group at the Technische Universität Berlin (TU-Berlin). Earlier in his career, Dr. Markl lead a research group at FORWISS, the Bavarian Research Center for Knowledge-based Systems in Munich, Germany, and was a Research Staff member & Project Leader at the IBM Almaden Research Center in San Jose, California, USA.

 

ConPaaS a Platform as a Service for Multi-clouds

Guillaume Pierre, IRISA / Université de Rennes

Guillaume Pierre is a Professor in Computer Science at the University of Rennes, France. Prior to this he spent 13 years at the VU University Amsterdam. His main interests are Cloud computing, Web application support, peer-to-peer and many other types of large-scale distributed systems.  He took part in several European and EIT ICT Labs projects and acted as the lead designer of the ConPaaS platform-as-a-Service environment. Pierre holds a PhD  in Computer Science from the University of Evry-val d’Essonne, France. He is also the academic coordinator of the EIT ICT Labs Master school at the University of Rennes

 

Making Big Data Analytics Interactive and Realtime

Matei Zaharia, UC Berkeley

The rapid growth in data volumes requires new computer systems that scale out across hundreds of machines. While early frameworks, such as MapReduce, handled large-scale batch processing, the demands on these systems have also grown. Users quickly needed to run (1) more interactive ad-hoc queries, (2) more complex multi-pass algorithms (e.g. machine learning and graph processing), and (3) real-time processing on large data streams. In this talk, we present a single abstraction, resilient distributed datasets (RDDs), that supports all of these emerging workloads by providing efficient and fault-tolerant in-memory data sharing. We have used RDDs to build a stack of computing systems including the Spark parallel engine, Shark SQL processor, and Spark Streaming engine. Spark and Shark can run machine learning algorithms and interactive queries up to 100x faster than Hadoop MapReduce, while Spark Streaming enables fault-tolerant stream processing at significantly higher scales than were possible before. These systems have been used in multiple industry and research applications, and have a growing open source community with over 15 companies contributing.

Matei Zaharia

Matei Zaharia finishing his PhD at UC Berkeley, where he worked with Scott Shenker and Ion Stoica on topics in large-scale data processing and cloud computing. After Berkeley, he will be starting an assistant professor position at MIT. During his PhD, Matei has also been an active open source contributor, becoming a committer on the Apache Hadoop project and starting the Mesos and Spark projects.

 

Where BigData makes a difference - real life client use cases, and the next big trend in the BigData field

David Rådberg and Livio Ventura, IBM

Livio Ventura gives us an insight to a few selected client engagements where BigData has made a crucial difference to the organizations we work with. He will also give us a greater understanding of how Datawarehousing, Hadoop, Stream computing, data exploration and BI can fit together in order to maximize a BigData initiative.
Finally he will give us a blink on what IBM now can see as a next big trend in the field.

Bios

David Rådberg leads the Swedish client engagements around BigData Analytics. He's been with IBM for five years and has a background within Data Storage. Before joining IBM he fulfilled his childhood dreams by working for a premiere league Swedish football team.

Livio Ventura is the Big Data Industry Leader for Telco, Energy and Utilities and Digital Media in Europe at IBM. In this capacity he is responsible for accelerating the market understanding of IBM’s Big Data solutions and its value proposition.

 

Applications Session

Big Cellular Network Data

Olof Görnerup, SICS Swedish ICT

Today essentially everybody uses mobile devices in cellular networks that generate huge volumes of data that can be refined into collective patterns that, in turn, enable a multitude of innovative applications. In this presentation I will discuss both the challenges and the potential of utilizing cellular network data, as well as present two examples of work done at SICS in this area, of relevance e.g. to urban planning and crisis management.

Olof Görnerup

Olof Görnerup is a senior researcher at SICS. He holds a PhD in the interdisciplinary field complex systems and has a research background in data and systems analysis and modeling with focus on data mining, machine learning and management of self-organizing systems.

 

Grid Computing for Big Data Challenges at the Large Hadron Collider

Mattias Ellert, eSSENCE and Uppsala University

To meet the big data challenges in experiments at the Large Hadron Collider (LHC), the European Organization for Nuclear Research (CERN) developed Grid computing. Because of the unique international position of CERN, the organisation of massive, world wide distributed computing has been successfully implemented with about 250.000 cores, 160 PB of disk storage and 90 PB of tape storage distributed over 150 sites around the world. The Grid was one of the key infrastructures leading to the discovery of the Higgs boson in July 2012. We will present the computing model for the ATLAS experiment at the LHC and the development of the Advanced Resource Connector (ARC) middleware that integrates computing resources (usually, computing clusters managed by a batch system) and storage facilities, making them available via a secure common Grid layer. The ARC middleware, which is jointly developed by the Nordic countries and others, allows for Grid computing on heterogeneous clusters and is deployed in European Grid tiers that use distributed resources.

Mattias Ellert

Mattias Ellert är verksam vid Institutionen för fysik och astronomi, avdelningen för högenergifysik vid Uppsala universitet. Ett av Mattias expertisområden är applikationsutveckling inom Grid Computing.

Extracting Value from Petabytes of Data at Spotify

Fabian Alenius, Product Owner of the Analytical Computation Squad at Spotify

Spotify brings you the right music for every moment. WIth over 24 million users and 20 million songs, Spotify has to deal with vast amounts of data everyday. Over the last few years Spotify has seen a small hadoop cluster for reporting grow into a platform that feeds the entire organization with data.

Fabian Alenius

Fabian Alenius is a product owner at Spotify working with data infrastructure and has seen the transition first hand.

 

Platforms Session

Building a Web Intelligence Machine

Staffan Truvé, CTO of Recorded Future.

Recorded Future automatically analyzes millions of documents from the web every day to help users analyze past, current, and future events. In this talk, we describe the overall system and its implementation, including challenges in linguistic analysis, database scalability, and information visualization. We will discuss how our cloud based system allows for continuous deployment and elastic scalability, and how we combine the use of algorithms and people to create really good analytic results.

 

Hop: Hadoop Open Platform-as-a-Service

Jim Dowling, SICS Swedish ICT and KTH

Hadoop is currently the de facto open-source standard for managing and processing big data. In Hop, we present an implementation of version 2 of Hadoop that supports both customizable meta-data for HDFS and open platform-as-a-service support for both public and private clouds (AWS, OpenStack). Our system supports an alternative HA model for Hadoop's NameNode, where metadata is persisted to MySQL Cluster resulting in higher performance and scalability. However, the main win in our architecture is support for user-defined meta-data for HDFS. We demonstrate its utility in providing block-level indexing for the analysis of genomic data.

Jim Dowling

Dr. Jim Dowling is an Associate Professor at KTH - Royal Institute of Technology, and a senior researcher at SICS. His research background is distributed systems, and he previously worked at MySQL AB. He is coordinator of the BiobankCloud FP7 project that is providing support for the storage and analysis of big genomic data.

 

Polyglot Persistence in the Cloud

Vinay Joosery, Severalnines

The relational database is no longer the default choice for applications. Welcome to the world of polyglot persistence, which is about using multiple data storage mechanisms to cater for the different needs of an application. Using SQL and NoSQL databases together is becoming very common, especially in cloud-based services. However, this flexibility also opens up a whole new set of challenges in terms of programming, deployment and operational complexity.

In this talk, we will take a look at the new database landscape, and see a demo of how multiple databases in hybrid cloud environments could be managed in the real world.

Vinay Joosery

Vinay Joosery is a passionate advocate and builder of concepts and businesses around Big Data computing infrastructures. Before co-founding Severalnines, he was VP EMEA at Pentaho, the Open Source BI leader. Prior to Pentaho, Vinay was at MySQL AB where he headed the Global Telecoms unit, and built the business around MySQL's High Availability and Clustering product lines. Before MySQL, Vinay served as Director of Sales and Marketing for Ericsson Alzato, an Ericsson-owned venture focused on parallel main-memory databases.