Workshop on Data analytics in the Cloud
Program
Session 1: 9.00-10:30
- 9:00-10:00 Keynote 1: Cloud Data Analytics: Challenges and Opportunities. Roger Barga (Microsoft Research).
Cloud computing has the
potential to transform how organizations use computing power to create
a collaborative foundation of shared analytics, mission-centric data,
and IT management. Challenges to the implementation of cloud computing
remain, but the new analytic capabilities of big data, ad-hoc
analysis, and massively scalable analytics offers an important
approach to achieving lasting strategic advantage. The purpose of this
talk is to discuss major macro trends that analytics practitioners and
researchers need to be cognizant of, along with details on micro
trends that correlate with the macro trends, and to present
frameworks, best practices, and real world examples.
- 10:00-10:30 Cost Models for View Materialization in the Cloud. Thi-Van-Anh Nguyen (CNRS, UMR 6072, GREYC, Université de Caen Basse), Laurent d’Orazio (CNRS, UMR 6158, LIMOS Université Blaise Pascal), Sandro Bimonte (Cemagref, UR TSCF), Jérôme Darmont (ERIC Lyon 2,Université de Lyon).
In classical databases,
query performance is casually achieved through physical data
structures such as caches, indexes and materialized views. In this
context, many cost models help select a ``best set'' of such data
structures. However, this selection task becomes more complex in the
cloud. The criterion to optimize is indeed at least two-dimensional,
with the monetary cost of using the cloud balancing query response
time. Thus, we define in this paper new cost models that fit into the
pay-as-you-go paradigm of cloud computing. These cost models help
achieve a multi-criteria optimization of the view materialization
vs. CPU power augmentation problem, under budget constraints. Finally,
we present experimental results that provide a first validation of our
contribution and show that cloud view materialization is always
desirable.
Session 2: 11:00-12:30
- 11:00-11:30 Stormy: An Elastic and Highly Available Streaming Service in the Cloud. Simon Loesing (ETH Zurich), Martin Hentschel (ETH Zurich), Tim Kraska (UC Berkeley), Donald Kossmann (ETH Zurich)
In
recent years, new highly scalable storage
systems have significantly contributed to the
success of Cloud Computing. Systems like
Dynamo or Bigtable have underpinned their
ability to handle tremendous amounts of data
and scale to a very large number of
nodes. Although these systems are designed the
store data, the fundamental architectural
properties and the techniques used (e.g.,
request routing, replication and
load-balancing) can also be applied to data
streaming systems. In this paper, we present
Stormy, a distributed stream processing
service for continuous data processing. Stormy
is based on proven techniques from existing
Cloud storage systems that are adapted to
efficiently execute streaming workloads. The
primary design focus lies in providing a
scalable, elastic and fault-tolerant framework
for continuous data processing while at the
same time optimizing resource utilization and
increasing cost-efficiency. Stormy is able to
process any kind of stream based workloads,
thus covering a wide range of application use
cases ranging from real-time data analytics to
long-term data aggregation jobs.
- 11:30-12:00 RDF Data Management in the Amazon Cloud. Francesca Bugiotti (Universita Roma Tre) Francois Goasdoue (INRIA Saclay & Universite Paris-Sud 11) Zoi Kaoudi (INRIA Saclay & Universite Paris-Sud 11) Ioana Manolescu (INRIA Saclay & Universite Paris-Sud 11)
Cloud computing has been massively adopted recently in many
applications for its elastic scaling and fault-tolerance. At the same time, given that the amount of available RDF
data sources on the Web increases rapidly, there is a constant need
for scalable RDF data management tools. In this paper we propose a novel
architecture for the distributed management of RDF data, exploiting an existing commercial cloud infrastructure, namely Amazon Web Services (AWS). We study the problem of indexing RDF data stored within AWS, by using SimpleDB, a key-value store provided by AWS for small data items. The goal of the index is to efficiently identify the RDF datasets which may have answers for a given query, and route the query only to those. We devised and experimented with several indexing strategies; we discuss experimental results and avenues for future work.
- 12:00-12:30 FunSQL: It is time to make SQL functional. Carsten Binnig (DHBW Mannheim), Franz Faerber (SAP), Robin Rehrmann (DHBW Mosbach), Rudolf Riewe (DHBW Mosbach)
With the rise of cloud-computing and cloud-scale data management the importance of shipping the code of an application to its data has increased tremendously. Especially when offering data analytics on top of traditional relational databases as a service in the cloud, new data-centric programming paradigms become necessary. Traditionally, relational databases offer two approaches to ship code close to the data: declarative SQL statements and imperative stored procedures. While SQL statements can be efficiently optimized and parallelized, stored procedures allow more complex logic that can be efficiently decomposed.
In this paper, we propose a novel functional language which extends SQL called FunSQL. FunSQL combines the best of both worlds: (1) it allows applications developers to implement more complex application logic as in SQL only, (2) the application logic can be decomposed efficiently and (3) it can be efficiently optimized and parallelized.
Session 3:14:00-15:30
- 14:00-15:00 Keynote 2: Big Data Analytics meets Query Optimization. Florian Waas (EMC/Greenplum)
Query optimization is known as the 'undertaker' of technology hypes in the data management arena: technologies come and go in cycles with certain patterns and query optimization is usually the last stage in a technology's progression. Once research has converged on query optimization for it, the new technology (or some part of it) is either accepted into mainstream or simply dies. Over the past decades the database community has seen numerous examples of this cycle including object-oriented databases, stream databases, XML databases, and many more.
Big Data analytics and cloud computing have become popular subjects with researchers and practitioners and a variety of platforms or methodologies have been touted as essential for Big Data. Query optimization starts to play an increasingly important role as latency considerations have come to dominate many application scenarios and more and more sophisticated techniques are required to access and process the vast amounts of data.
In this presentation, we assess the hype cycle for Big Data/Cloud technologies, look at challenges for query optimization, and try to answer the question whether big data/cloud computing and large scale analytics are converging toward mainstream data management any time soon.
- 15:00-15:30 Challenges and Approaches for Distributed Workflow-Driven Analysis of Large-Scale Biological Data. Ilkay Altintas (San Diego Supercomputer Center, University of California, San Diego) Jianwu Wang (San Diego Supercomputer Center, University of California, San Diego) Daniel Crawl (San Diego Supercomputer Center, University of California, San Diego) Weizhong Li (Center for Research in Biological Systems, University of California, San Diego)
Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor bioinformatics algorithms. Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System module, called bioKepler, that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This vision paper discusses the challenges related to next-generation sequencing data, explains the approaches taken in bioKepler to help with analysis of such data, and presents preliminary results demonstrating these approaches.
Session 4: 16:00-17:30
- Panel Discussion. Roger Barga (Microsoft Research), Michael J. Carey (UC Irvine), Goetz Grefe (HP Labs), Jeffrey D. Ullman (Stanford University), Florian Waas (EMC/Greenplum), Kostas Tzoumas (TU Berlin, moderator)
Workshop theme
Due to unprecedented data growth, a need for rich data analysis on
petabyte-scale data is emerging. To support such analysis, new
data-centric programming paradigms and data management systems are
being established, popularized by Google's MapReduce framework and its open-source implementation, Hadoop. The new data analysis market raises
several research challenges spanning the whole system stack, from
storage and network technologies to programming languages.
At the same time, cloud computing is emerging as a cost-effective
paradigm for massively scalable, fault-tolerant, and adaptive
computation. Cloud computing architectures scale to massive numbers of
commodity computers and adapt to changing hardware availability and
requirements by dynamically allocating virtualized computing
nodes. Cloud computing systems often use a computational model
motivated by functional programming, abstracting away the internals of
computation. Both enterprise and client data are moving to the cloud
for reasons of cost, reliability, and manageability. This
migration poses significant challenges on current systems. The economies
of scale provided by cloud computing provide opportunities for richer
data analysis on even larger data sets. The new data analysis
applications and the unprecedented scale is not adequately served by
current offerings, including commercial DBMSs and analytics systems,
and open-source cloud computing systems.
This workshop will provide a perfect forum to bring together
researchers and practitioners interested in big data analytics, cloud
computing, and their intersection. The workshop will help to foster
future collaborations and the formation of a community that sets the
ground of this emerging field.
Topics of Interest
Submissions of original research contributions are invited for all
relevant topics, including, but not limited to:
- Analytic frameworks for cloud systems
- Data models and query languages
- Parallel query processing and optimization
- Scalable storage and indexing
- Workload management
- Data privacy and security
- Administration and manageability
- Benchmarking, tuning, and testing
- Energy management
- Industrial experience and use cases
- Data science and analytics technologies
- Scientific data management
- Scalable machine learning
Important Dates
Submission deadline (extended):
December 14, 2011
Notification to authors:
January 15, 2012
Camera ready papers due:
February 2, 2012
Workshop:
March 30, 2012
Submission Guidelines
All papers should be formatted using the double-column ACM format
(templates available at:
http://www.acm.org/sigs/publications/proceedings-templates). The
workshop solicits:
- Regular Research Papers (maximum length: 12 pages)
- Vision Papers (maximum length: 6 pages)
- Experience Reports (maximum length: 6 pages)
Papers should be submitted using the
conference management system. The workshop proceedings will be
published in the ACM Digital Library.
Submission site: https://cmt.research.microsoft.com/DANAC2012/
People
PC chairs
Tim Kraska, UC Berkeley, USA
Kostas Tzoumas, TU Berlin, Germany
Steering Committee
Michael J. Carey, UC Irvine, USA
Volker Markl, TU Berlin, Germany
Program Committee
Shivnath Babu, Duke University, USA
Magdalena Balazinska, University of Washington, USA
Alexandru Iosup, TU Delft, The Netherlands
Donald Kossmann, ETH Zurich, Switzerland
Sam Madden, MIT, USA
Ioana Manolescu, INRIA, France
Jignesh Patel, University of Wisconsin-Madison, USA
Christopher Re, University of Wisconsin-Madison, USA
Mirek Riedewald, Northeastern University, USA
Marcos Vaz Salles, University of Copenhagen, Denmark
Florian Waas, EMC/Greenplum, USA
Keynotes
Keynote 1: Roger Barga, Microsoft Research, USA
Keynote 2: Florian Waas, EMC, USA