GraphLab conference attracts the most interesting emerging data science projects. Join us on Monday July 21, 2014 at the Nikko Hotel in SF.
We will have oral talks from GraphLab, Spark, Datapad (a startup from the creator of python pandas), Trifacta ( a startup from the creator of d3.js), Cloudera, Microsoft, Google, Pivotal, Adobe, Lab41, CMU and Pandora.
We can roughly divide the presenters to several domains: graph analytics (graphlab, pregel, petuum, grappa, stinger, grafos.ml, parameter server etc.), graph databases, graph visualization, python data science tools, and applications on graphs.
Graph Databases is an emerging field. Graph databases are used to store and query the graph and are optimized for high performance on data which has a graph structure. We will have demos from all the influential graph databases out there: Neo Technology (Neo4j), Aurelius (Titan), Franz, Objectivity (InfiniteGraph), Sparsity Technologies, which are all the leading graph databases companies.
Visualization helps data scientists deep dive into their data. In terms of visualization, we will have presentations for Trifacta, Cambridge Intelligence, Graphistry (viz using gpus), Linkurious (a startup from the creators of Gephi open source), Ayasdi, Tom Sawyer Software, Plot.ly
We have a very interesting presence of academic projects. Some examples are Petuum (CMU) a new system by Prof. Eric Xing, Parameter Server (CMU) a mega scale framework for cluster implementation of ML methods by Prof. Alex Smola. Grappa (UW) by mark Oskin from UW, a super fast graph analytic framework. Stinger – a streaming graph system from Georgia Tech.
Abstract: GraphLab Strategy, Vision and Practice
Carlos is the CEO and co-founder of GraphLab and the Amazon Professor of Machine Learning in Computer Science & Engineering at the University of Washington. A world-recognized leader in the field of Machine Learning, Carlos was named one of the 2008 “Brilliant 10″ by Popular Science Magazine, received the 2009 IJCAI Computers and Thought Award for his contributions to Artificial Intelligence, and a Presidential Early Career Award for Scientists and Engineers (PECASE).
Baldo Faieta, Social Computing Lead, Adobe Systems
Abstract: Algorithms for Creatives Talent Search using GraphLab
In this talk, we present the work we’ve been doing at Adobe to build a custom recommendation solution for Behance, our social network for creatives, to surface our creative users and their work. We’ll introduce custom ranking and similarity algorithms that analyze the social graph data generated by the activity of our users and match the specific requirements for the task of talent search. We show how we leverage Graphlab both to implement our custom algorithms as well as use those provided by the platform and discuss the end-to-end workflows in the context of the Graphlab Create framework.
Baldo Faieta is a senior computer scientist at Adobe working as the lead in the social computing group of Adobe’s Behance, a social network for creatives. He joined Adobe after being founder and CTO for several start ups in Silicon Valley and in Europe. Baldo received his B.Sc. in Computer Science from Carnegie Mellon University. He was also a member of the research staff at the Xerox Palo Alto Research Center and the Interval Research Corp.
Abstract: Dendrite: A web based platform for large scale graph analytics
Lab41 developed Dendrite, a proof-of-concept, open source web application that allows less technical subject matter experts to collaboratively track and analyze graph data through a web based user interface that tracks changes. Dendrite enables this form of iterative analysis by integrating Titan and GraphLab into a coherent system. The resulting versioning and information sharing accelerates analysis as well as result quality.
Lab41 (www.lab41.org) is a Challenge Lab that brings together stakeholders from industry, academia, government, and IQT to collaborate on hard problems in Big Data analytics. Lab41 is located in Menlo Park, where it shares office space with its parent In-Q-Tel (www.iqt.org).
Erick Tryzelaar is a Senior Software Engineer at In-Q-Tel/Lab41, where he led the development of Dendrite. Prior to Lab41, he was a software engineer at Zynga, where he owned the server configuration management and provisioning system. He also worked at Pixar on their server render farm software. In his spare time, he is one of the core contributors of Mozilla’s Rust systems programming language.
Tao Ye, Senior Scientist, Pandora Internet Radio
Abstract: Large scale music recommendation @ Pandora
Pandora is best known for the Music Genome Project, the most unique and richly labelled 1.5 million+ song data. Naturally a content based approach to music recommendation is used as the foundation to our online radio service. Over the years we have improved and transformed the recommendation platform to incorporate multi-facted data and models on this foundation. Combined with a dynamic ensemble system, this platform now powers the most popular streaming music service in the U.S., with 77 million+ monthly active users. In this talk I will discuss the music recommendation topics we work on at Pandora such as our ever evolving machine learning tasks, give examples on user modeling tasks, and share challenges we still face.
Prof. Alex Smola, CMU and Google
Abstract: Scaling Distributed Machine Learning with the Parameter Server
We propose a Parameter Server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. The framework manages asynchronous data communications between nodes, and supports flexible consistency models, elastic scalability, and continuous fault tolerance. To demonstrate the scalability of the proposed framework, we show experimental results on petabytes of real data with billions of examples and parameters on problems ranging from Sparse Logistic Regression to Latent Dirichlet Allocation and Distributed Sketching.
Prof. Joe Hellerstein, Founder & CEO of Trifacta
Abstract: Data, DSLs and Transformation: Research and Practice”
Joseph M. Hellerstein is co-founder and CEO of Trifacta, and Chancellor’s Professor of Computer Science at the University of California, Berkeley. His work focuses on data-centric systems and the way they drive computing. He is an ACM Fellow, an Alfred P. Sloan Research Fellow and the recipient of three ACM-SIGMOD “Test of Time” awards for his research. In 2010, Fortune Magazine included him in their list of 50 smartest people in technology , and MIT’s Technology Review magazine included his Bloom language for cloud computing on their TR10 list of the 10 technologies “most likely to change our world”.
Reynold Xin, Co-Founder, Databricks:
Abstract: Unified Data Pipeline in Apache Spark
One of the promises of Apache Spark is to let users build unified data analytic pipelines that combine diverse processing types. In this talk, we’ll demo this live by building a machine learning pipeline with 3 stages: ingesting JSON data; training a k-means clustering model; and applying the model to a live stream of tweets. Typically this pipeline might require a separate processing framework for each stage, but we can leverage the versatility of the Spark runtime to combine SQL, MLlib, and Spark Streaming and do all of the data processing in a single, short program. This allows us to reuse code and memory between the components, improving both development time and runtime efficiency. Spark as a platform integrates seamlessly with Hadoop components, running natively in YARN and supporting arbitrary Hadoop InputFormats, so it brings the power to build these types of unified pipelines to any existing Hadoop user. This talk will be a fully live demo and code walkthrough where we’ll build up the application throughout the session, explain the libraries used at each step, and finally classify raw tweets in real-time.
Reynold Xin is a Apache Spark committer and a co-founder of Databricks. He is currently on leave from his PhD studies at UC Berkeley AMPLab, where he focused on scalable data processing.
Wes McKinney, Founder & CEO, DataPad
Abstract: Fast Medium Data Analytics at Scale
Abstract: The Zoo Expands: Labrador *Loves* the Elephant; GraphLab on Hadoop, Thanks to Hamster
The refactoring of Hadoop MapReduce framework, by separating resource management (YARN) from job execution (MapReduce) has allowed multiple programming paradigms to take advantage of the massive scale Hadoop Distributed File System (HDFS) clusters. Hamster (Hadoop And Mpi on the same cluSTER) is a port of OpenMPI to use YARN as a resource manager. Hamster allows applications written using MPI (Message Passing Interface) to run alongside other YARN applications and frameworks, such as MapReduce, on the same Hadoop cluster. In this talk, I will describe the architecture of Hamster, and present a few MPI applications that have been demonstrated to run in Hadoop. GraphLab uses MPI as one of the supported communication libraries, and can read/write data from/to HDFS. I will describe how GraphLab runs on top of Hadoop using Hamster, and present a few benchmarks in graph analytics, comparing GraphLab with other machine frameworks.
Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working with Hadoop since version 0.1.0. He started the Yahoo! Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years, and his area of specialization for PhD (Computer Science) from University of Illinois at Urbana-Champaign. He worked at the Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo! and Linkedin. Currently, he is the Chief Scientist at Pivotal (formerly, Greenplum, a division of EMC).
Abstract: ASYMP: Fault-tolerant Graph Mining via ASYnchronous Message Passing
In this talk, we introduce ASYMP, a new distributed computation framework, developed at Google, which is most suitable for implementing large-scale graph mining algorithms. Unlike several previous distributed frameworks such as MapReduce and Pregel which support synchronized computation, this framework implements an “asynchronous” message passing computation framework. On the other hand, similar to the MapReduce framework, ASYMP supports a high level of fault-tolerance for machine failures or pre-emption. To cope with this fault-tolerance, algorithms developed in this framework has to be self-stabilizing. In this talk, after giving a general description of the new framework and its support for fault tolerance, we report its performance in comparison to the MapReduce and Pregel frameworks for some basic graph mining tasks such as computing shortest paths and connected components. We conclude by various open research directions related to this framework.(joint with Eduardo Fleury and Silvio Lattanzi)
Vahab Mirrokni is a senior staff research scientist at Google Research NYC, where he is heading the algorithms research group. He joined Google after several research appointments at Microsoft Research, MIT and Amazon. He received his PhD from MIT and his B.Sc. from Sharif University. Vahab’s research interests include large-scale graph mining, approximation algorithms, and algorithmic game theory. At Google, he is mainly working on algorithmic and economic problems related to the internet search and online advertising. As two of his recent projects, he is serving a tech lead manager for the large-scale graph mining, and the market algorithms teams based in Google Research NYC.
Josh Wills, Director of Data Science, Cloudera
Abstract: What Comes After The Star Schema?
The star schema is one of the most important ideas in the history of analytics and data management. It provided a common language and set of patterns for querying and exploring transaction-oriented datasets and spawned an entire industry of software tools that revolved around it. As we move into a world in which most of the data we analyze will be generated by machines and sensors, I’d like to explore alternative data models that we can build a new industry around, leveraging ideas from array programming, data historians, and graph processing engines.
Josh Wills is Cloudera’s Senior Director of Data Science, working with customers and engineers to develop Hadoop-based solutions across a wide-range of industries. He is the founder and VP of the Apache Crunch project for creating optimized MapReduce pipelines in Java and lead developer of Cloudera ML, a set of open-source libraries and command-line tools for building machine learning models on Hadoop. Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+.
Dr. Markus Weimer, Microsoft Research REEF.
Abstract: Towards a Big Data std lib
The availability of powerful distributed data platforms and the widespread success of Machine Learning (ML) has led to a virtuous cycle wherein organizations are investing in gathering a wider range of (even bigger!) datasets and addressing an even broader range of tasks. The Hadoop Distributed File System (HDFS) is being provisioned to capture and durably store these datasets. Alongside HDFS, resource managers like Mesos and YARN enable the allocation of compute resources “near the data,” where applications can cache it and support fast iterative computations. Unfortunately, most machine learning systems are not tuned to operate on these new cloud platforms, where the system one executes on can no longer be viewed as static as in traditional cluster computing. Two new runtime challenges arise: 1) scale-up: the need to acquire more resources dedicated to a particular algorithm, and 2) scale-down: the need to react to resource preemption and failure. In this talk, I will present REEF, our abstraction on top of resource managers and recent work towards truly resource-aware machine learning algorithms and systems developed on it. As I will demonstrate, machine learning is especially well prepared to treat varying resources natively, as a change in resources can often be equated to changes in the available data.
Dr. Markus Weimer is a Principal Scientist with the Cloud and Information Services Lab at Microsoft, Redmond. There, his work focusses on big data systems with a special emphasis on machine learning and graph computation applications. Markus has several years of experience with Big Data machine learning systems and applications. You can follow and reach him @markusweimer.
Click here for a full list of Demos scheduled for 7/21/14
O’Reilly spreads the knowledge of innovators through its technology books, online services, magazines, research, and tech conferences. Since 1978, O’Reilly has been a chronicler and catalyst of leading-edge development, homing in on the technology trends that really matter and galvanizing their adoption by amplifying “faint signals” from the alpha geeks who are creating the future. An active participant in the technology community, O’Reilly has a long history of advocacy, meme-making, and evangelism.