DanaC: Workshop on Data analytics at sCale
DanaC was be held in conjunction with SIGMOD/PODS 2015 in Melbourne, Victoria, Australia. You can find the proceedings in the ACM library.
DanaC 2015 Programme Overview
The workshop will take place on Sunday 31st May, and will feature:
- An industrial keynote by Carlo Curino (Microsoft) titled "BigData: running punctually at data-center scale and beyond"
- An academic keynote by Chris Re (Stanford University) titled "DeepDive and it's applications in science and the fight against human trafficking!"
- 4 research paper presentations
- 2 invited presentations of scalable machine learning systems, namely SystemML, and Apache SINGA
Detailed Schedule (draft)
08:45 - 10:00 Session 1: Industrial Keynote Speech & "Gong show"
- 08:45 - 09:00 Welcome
- 09:00 - 10:00 Industrial Keynote by Carlo Curino (Microsoft Research)
Title:"BigData: running punctually at data-center scale and beyond"
Abstract: In this keynote, I discuss some of the BigData research ongoing in my group at Microsoft: the Cloud and Information Services Lab (CISL). After introducing the general paradigm of a Cluster-OS (the separation of application-level concerns from general purpose resource management), I focus on the Apache Hadoop / YARN incarnation of this paradigm, and touch on few areas of active research and development. Time permitting, I will: 1) discuss key technologies that provide explicit time-based SLAs (deadlines) for production workloads on BigData clusters, 2) present motivation and ongoing work that enable us to scale Hadoop to data-center scale, and 3) present an approach for bandwidth-conscious and sovereignity-aware BigData analytics across geographically distributed data-centers.
Bio: Carlo Curino received a PhD from Politecnico di Milano, and spent two years as Post Doc Associate at CSAIL MIT leading the relational cloud project. He worked at Yahoo! Research as Research Scientist focusing on mobile/cloud platforms and entity deduplication at scale. Carlo is currently a Senior Scientist at Microsoft in the Cloud and Information Services Lab (CISL) where he is working on big-data platforms and cloud computing. Carlo is an active committer in the Apache Hadoop project.
10:00 - 10:30 Coffee break
10:30 - 12:00 Session 2: Research Paper Presentations
- "The Vision of BigBench 2.0": Tilmann Rabl (University of Toronto), Michael Frank (bankmark), Manuel Danisch (bankmark), Hans-Arno Jacobsen (University of Toronto), Bhaskar Gowda (Intel).
- "High-Performance Main-Memory Database Systems and Modern Virtualization: Friends or Foes?": Tobias Mühlbauer (TU Munich), Wolf Rödiger (TU Munich), Andreas Kipf (TU Minuch), Alfons Kemper (TU Miunich), Thomas Neumann (TU Munich).
- "Speculative Approximations for Terascale Distributed Gradient Descent Optimization": Chengjie Qin (UC Merced), Florin Rusu (UC Merced).
11:45 - 13:30 Lunch Break
13:30 - 14:30 Session 3: Academic Keynote
- Academic Keynote Speech by Chris Re (Stanford University)
Title: "DeepDive and it's applications in science and the fight against human trafficking!"
Abstract: Many pressing questions in science are macroscopic, as they require that a scientist integrate information from many sources of data, often expressed in natural languages or in graphics; these forms of media are fraught with imprecision and ambiguity and so difficult for machines to understand. This talk describes DeepDive, a new type of statistical extraction and integration system to cope with these problems. For some tasks in paleobiology, DeepDive-based systems have surpassed human volunteers in data quantity, recall, and precision. This talk will describe some applications of DeepDive including to genomics, the fight against human trafficking, and enterprise applications. This talk will describe DeepDive's technical core of classical data management techniques as well as its new techniques for efficient statistical computation including Hogwild! and its successor the DimmWitted engine.
Bio: Christopher (Chris) Re is an assistant professor in the Department of Computer Science at Stanford University and a Robert N. Noyce Family Faculty Scholar. His work's goal is to enable users and developers to build applications that more deeply understand and exploit data. Chris received his PhD from the University of Washington in Seattle under the supervision of Dan Suciu. For his PhD work in probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. He then spent four wonderful years on the faculty of the University of Wisconsin, Madison, before moving to Stanford in 2013. He helped discover the first join algorithm with worst-case optimal running time, which won the best paper at PODS 2012. He also helped develop a framework for feature engineering that won the best paper at SIGMOD 2014. In addition, work from his group has been incorporated into scientific efforts including the IceCube neutrino detector and PaleoDeepDive, and into Cloudera's Impala and products from Oracle, Pivotal, and Microsoft's Adam. He received an NSF CAREER Award in 2011, an Alfred P. Sloan Fellowship in 2013, and a Moore Data Driven Investigator Award in 2014. Chris was an early member of and continues to be an adviser to Context Relevant.
15:00 - 15:30 Coffee Break
15:30 - 17:00 Session 4: Scalable Machine Learning
- Research Paper: "Caffe con Troll: Shallow Ideas to Speed Up Deep Learning": Stefan Hadjis (Stanford University), Firas Abuzaid (Stanford University), Ce Zhang (Stanford University), Chris Re (Stanford University).
- "SystemML’s Optimizer: Advanced Compilation Techniques for Large-Scale Machine Learning Programs": Matthias Boehm (IBM Almaden).
- "Apache SINGA: A general distributed deep learning platform": Wei Wang (National University of Singapore).
- "Speculative Approximations for Terascale Distributed Gradient Descent Optimization": Chengjie Qin (UC Merced), Florin Rusu (UC Merced)
- "Caffe con Troll: Shallow Ideas to Speed Up Deep Learning": Stefan Hadjis (Stanford University), Firas Abuzaid (Stanford University), Ce Zhang (Stanford University), Chris Re (Stanford University)
- "The Vision of BigBench 2.0": Tilmann Rabl (University of Toronto), Michael Frank (bankmark), Manuel Danisch (bankmark), Hans-Arno Jacobsen (University of Toronto), Bhaskar Gowda (Intel)
- "High-Performance Main-Memory Database Systems and Modern Virtualization: Friends or Foes?": Tobias Mühlbauer (TU Munich), Wolf Rödiger (TU Munich), Andreas Kipf (TU Munich), Alfons Kemper (TU Miunich), Thomas Neumann (TU Munich)
- Asterios Katsifodimos (Technische Universität (TU) Berlin)
- Magdalena Balazinska (University of Washington)
- Michael J. Carey (University of California, Irvine)
- Volker Markl (Technische Universität (TU) Berlin)
- Stratos Idreos (Harvard University, USA)
- Christoper Re (Stanford University, USA)
- Jorge-Arnulfo Quiané-Ruiz (QCRI, Qatar)
- Sudip Roy (Cornell University, USA)
- Emad Soroush (Dato Inc., USA)
- Konstantinos Karanasos (Microsoft Research, USA)
- Till Westmann (Oracle Labs, USA)
- Spyros Blanas (Ohio State University, USA)
- Frank McSherry
- Nodira Khoussainova (Twitter Inc., USA)
- Markus Weimer (Microsoft Research, USA)
- Minqi Zhou (East China Normal University, China)
- Neoklis Polyzotis (University of California - Santa Cruz & Google Inc., USA)
Call for Papers
Data nowadays comes from various sources including log files, transactional applications, the Web, social media, scientific experiments, and many others. In recent years, various analyses of these data have proven useful to aid companies in engaging and serving their users and defining their corporate strategy, help political candidates win elections, and transform the process of scientific discovery. However, these successes are just the tip of the iceberg: Every day, new, more complex analysis techniques are devised and larger, more varied datasets are accumulated. Tackling the complexity of both the data itself and its analysis remains an open challenge.
Big Data has brought new challenges in data management and analysis, and is currently changing the landscape of database technology. Much of this data is generated and transmitted in real time and at an unprecedented scale. To extract value out of these data sets, business analysts and scientists employ advanced data analytics techniques combining, among others, traditional BI, text analytics, machine learning, data mining, and natural language processing. Dataflow and streaming execution engines, programming models, novel data analysis languages and cluster resource managers are only a few examples of the efforts to tackle these challenges. However, we are still far from being able to ingest, query and visualize large amounts of data in a scalable, easy to use and standardized manner.
The target audience of the workshop is database, storage, networking, and distributed systems researchers, domain scientists using Big Data for their research, practitioners building systems for Big Data processing, as well as data analysts that extract value out of Big Data. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of vision papers, outrageous ideas, data analysis systems and architectures and industrial experience reports of data analytics.
Topics of InterestAreas of particular interest for the workshop include (but are not limited to):
- Scalable machine learning
- Large scale stream processing
- Scientific and industrial experiences and use cases
- Benchmarking, tuning, and testing
- Data science and analytics
- Data management and analytics as a service
- Languages for Big Data analytics
- Parallel execution and optimization
- Scalable storage and indexing
- Workload management
- Infrastructures for cloud computing
- Innovative data analysis systems and architectures
Submissions should take the form either of a full paper of up to 10 pages (including references), or of an extended abstract of no more than 4 pages. The 4-page option is available to enable subsequent publication in journals or other conferences that would not consider results published in preliminary form. All accepted papers will be published in DanaCs proceedings at ACM.
Papers must follow the ACM Proceedings Format, using one of the templates provided at ACM.The workshop solicits:
- Research Papers
- Vision Papers
- Innovative Systems and Architectures
- Demonstration Papers
- Use Cases
- Controversial Topics
- Industrial Experiences
- Submission deadline: April 1, 2015, 23:59 (PST)
- Notification of acceptance: April 20, 2015
- Final papers due: April 30, 2014
- Workshop: May 31, 2015