DanaC: Workshop on Data analytics in the Cloud
Data analytics has the potential to be a transformer of scientific research, and data-driven business decisions. By effectively analyzing huge volumes of data, scientific research can be transformed from hypothesis-driven to data-driven, where forming scientific hypotheses will be aided by discovering patterns in vast quantities of data. For most technology companies that operate on a Web scale, analyzing customer data can provide insights on customer behavior, and lead to answers for critical business decisions.
Cloud computing has emerged as a cost-effective and elastic computing paradigm.
Cloud infrastructures scale to massive numbers of commodity computing nodes and
provide adaptive provisioning without prohibitive initial investments.
Data analytics has the potential to be a significant cloud application, and to constitute a large
fraction of the workload of modern data centers.
Designing the infrastructures and systems for data
management in the new computing environments remains an open challenge.
Topics of Interest
Areas of particular interest for the workshop include (but are not limited to):
- Parallel execution and optimization
- Scalable storage and indexing
- Workload management
- Infrastructures for cloud computing
- Scalable machine learning
- Frameworks for parallel computing
- Industrial experiences and use cases
- Benchmarking, tuning, and testing
- Data science and analytics
- Privacy and security in the cloud
- Economic models for data
- Data management and analytics as a service
Find the workshop proceedings here.
9:00 - 10:30 (Session 1)
- Opening Remarks
- Keynote by Jingren Zhou: "Web-scale Analytics: Parallel Databases Meet MapReduce"
Massive data analysis in cloud-scale data centers plays a crucial role in making critical business decisions nowadays. In this talk, I describe a cloud-scale distributed computation system, called SCOPE, targeted for massive data analysis over tens of thousands of machines at Microsoft Bing. SCOPE combines benefits from both traditional parallel databases and MapReduce execution engines to allow easy programmability and deliver massive scalability and high performance through advanced optimization. Similar to parallel databases, the system has a SQL-like declarative scripting language with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. An optimizer is responsible for converting scripts into efficient execution plans for the distributed computation engine. A physical execution plan consists of a directed acyclic graph (DAG) of vertices. Execution of the plan is orchestrated by a job manager that schedules execution on available machines and provides fault tolerance and recovery, much like MapReduce systems. SCOPE is being used daily for a variety of data analysis and data mining applications over tens of thousands of machines at Microsoft, powering Bing and other online services.
- "Gong show:" 2-minute teaser of all papers accepted at the workshop (at the same order as they appear in the program)
10:30 - 10:45
10:45 - 12:15 (Session 2)
ScyPer: Elastic OLAP Throughput on Transactional Data. Tobias Mühlbauer (TUM), Wolf Rödiger (TUM), Angelika Reiser (TUM), Alfons Kemper (TUM), Thomas Neumann (TUM).
Scalable I/O-Bound Parallel Incremental Gradient Descent for Big Data Analytics in GLADE. Chengjie Qin (UC Merced), Florin Rusu (UC Merced).
Towards a Workload for Evolutionary Analytics.Jeff LeFevre (UC Santa Cruz), Jagan Sankaranarayanan (NEC Labs America), Hakan Hacigumus (NEC Labs America), Junichi Tatemura (NEC Labs America), Neoklis Polyzotis (UC Santa Cruz).
Don't Match Twice: Redundancy-free Similarity Computation with MapReduce. Lars Kolb (U. Leipzig), Andreas Thor (U. Leipzig), Erhard Rahm (U. Leipzig).
12:15 - 13:45
13:45 - 15:15 (Session 3)
A Vision for Personalized Service Level Agreements in the Cloud. Jennifer Ortiz (U. Washington), Victor de Almeida (U. Washington), Magdalena Balazinska (U. Washington).
Multi-objective optimization of data flows in a multi-cloud environment. Efthymia Tsamoura (AUTH), Anastasios Gounaris (AUTH), Kostas Tsichlas (AUTH).
GPText: Greenplum Parallel Statistical Text Analysis Framework. Kun Li (U. Florida), Christan Grant (U. Florida), Daisy Zhe Wang (U. Florida), Sunny Khatri (EMC), George Chitouras (EMC).
Enabling Secure Query Processing in the Cloud using Fully Homomorphic Encryption. Murali Mani (University of Michigan, Flint).
A Case For Dynamic Memory Partitioning in Data Centers. Daniel Warneke (ICSI Berkeley), Christof Leng (ICSI Berkeley).
15:15 - 15:30
15:30 - 17:00 (Session 4)
Panel discussion: "What will be the 'SQL' of 'Big Data NoSQL' systems?"
Daniel Abadi (Yale), Shivnath Babu (Duke), Fatma Ozcan (IBM Almaden), Jeffrey Ullman (Stanford), Till Westmann (Oracle), Jingren Zhou (Microsoft)
Moderator: Volker Markl (TU Berlin)
Big data analytics has given rise a new class of data management systems, e.g., Graphlab, Spark, map/reduce (Hadoop), Asterix, Stratosphere, and others. These systems have introduced novel query or data analysis languages, all of which have the goal to support data analysis applications that go beyond selection, aggregration, or relational queries, most notably enabling machine learning algorithms, graph mining, text mining, or mathematical optimization. We currently see a confusion with respect to data programming languages of babylonic proportions, with a lack of agreement on a common model and query processing language. In particular, some parts of the community seem to be running in circles, with some protagonists of the NoSQL movement implementing subsets of SQL or XQuery on top of Hadoop (e.g., Pig, Hive, JAQL). However, a standardized language could be a key factor for market growth and future mainstream success of these systems beyond niche solutions.
17:00 - ...
Social event at the Long Room (very close to the conference hotel)
- Lars Kolb (U. Leipzig), Andreas Thor (U. Leipzig), Erhard Rahm (U. Leipzig). Don't Match Twice: Redundancy-free Similarity Computation with MapReduce.
- Efthymia Tsamoura (AUTH), Anastasios Gounaris (AUTH), Kostas Tsichlas (AUTH). Multi-objective optimization of data flows in a multi-cloud environment .
- Tobias Mühlbauer (TUM), Wolf Rödiger (TUM), Angelika Reiser (TUM), Alfons Kemper (TUM), Thomas Neumann (TUM). ScyPer: Elastic OLAP Throughput on Transactional Data.
- Chengjie Qin (UC Merced), Florin Rusu (UC Merced). Scalable I/O-Bound Parallel Incremental Gradient Descent for Big Data Analytics in GLADE.
- Jennifer Ortiz (U. Washington), Victor de Almeida (U. Washington), Magdalena Balazinska (U. Washington). A Vision for Personalized Service Level Agreements in the Cloud.
- Jeff LeFevre (UC Santa Cruz), Jagan Sankaranarayanan (NEC Labs America), Hakan Hacigumus (NEC Labs America), Junichi Tatemura (NEC Labs America), Neoklis Polyzotis (UC Santa Cruz). Towards a Workload for Evolutionary Analytics.
- Kun Li (U. Florida), Christan Grant (U. Florida), Daisy Zhe Wang (U. Florida), Sunny Khatri (EMC), George Chitouras (EMC). GPText: Greenplum Parallel Statistical Text Analysis Framework.
- Murali Mani (University of Michigan, Flint). Enabling Secure Query Processing in the Cloud using Fully Homomorphic Encryption.
- Daniel Warneke (ICSI Berkeley), Christof Leng (ICSI Berkeley). A Case For Dynamic Memory Partitioning in Data Centers.
All papers should be submitted in pdf and formatted using the double-column ACM format (templates are available here).
The workshop solicits:
- research papers (maximum length: 5 pages)
- vision papers (maximum length: 5 pages)
- industrial experience reports (maximum length: 5 pages)
All papers should clearly mark their type (research/vision/industrial) in the paper title.
Papers should be submitted using the conference management system: https://cmt.research.microsoft.com/DANAC2013
|Notification of acceptance:||April 26, 2013|
|Final papers due:|
|Workshop:||June 23, 2013|
Please remove the research/vision/industry qualifier from your paper title (that is, unless it is part of the title sentence)
Length: All submitted papers must be formatted according to the instructions below, and must be no more than 5 pages in length. This page limit includes all parts of the paper: title, abstract, body, bibliography, and appendices.
File type: Each paper is to be submitted as a single PDF file, formatted for 8.5" x 11" paper and no more than 5 MB in file size. (Larger files will be rejected by the submission site.)
Formatting: Papers must follow the ACM Proceedings Format, using one of the templates provided here for Word and LaTeX (version 2e). (For LaTeX, both Option 1 and Option 2 are acceptable.) The font size, margins, inter-column spacing, and line spacing in the templates must be kept unchanged.
Authors should apply ACM Computing Classification categories and terms. The templates provide space for this indexing and point authors to the Computing Classification Scheme.
The CR version must also include a copyright statement at the bottom of the first page, left column. ACM will contact authors to complete a rights management form and will subsequently provide the appropriate statement. Please contact the chairs of the workshop if you do not hear from ACM about the rights management form.
All fonts MUST be embedded within the PDF file. Any PDF that
is not deposited with fonts embedded will need to be corrected. In
order to help you through this process, ACM has
on how to embed your fonts. Please download the ACM Digital
Library optimal distiller settings
file, ACM.joboptions. ACM
cannot substitute font types, though. This really must be done in the
source files before the Postscript or PDF is generated. If bit-mapped
fonts are used, they will not necessarily display legibly in all PDF
readers on all platforms, though they will print out fine.
The camera-ready version (in PDF) should be submitted on-line through
paper submission site.
Michael J. Carey
Michael Armbrust (Google, USA)
Yanpei Chen (Cloudera, USA)
Vuk Ercegovac (IBM Almaden, USA)
Shenoda Guirguis (Intel, USA)
Hakan Hacigumus (NEC Labs, USA)
Donald Kossmann (ETH Zurich, Switzerland)
Jignesh Patel (University of Wisconsin – Madison, USA)
Christopher Re (University of Wisconsin – Madison, USA)
Russell Sears (Microsoft, USA)
Ion Stoica (UC Berkeley, USA)
Philipp Unterbrunner (Oracle Labs, USA)
Florian Waas (EMC, USA)
DanaC 2012: www.danac.org/2012/