MLconf Atlanta

The first Atlanta-based MLconf was Friday, September 19th at the Academy of Medicine. The focus of the event was on ML platforms, tools and algorithms. We hosted a speaker from Facebook who gave us an overview of how machine learning shapes the biggest social network. We were very happy to have speakers from emerging platforms like SkyTree with their super fast and scalable machine learning server, and 0xdata who presented their open source platform and their new deep learning toolbox. The event hosted a talk on Systap’s graph analytics and machine learning platform on GPUs and Cloudera presented on their current trends. Professor Manos Antonakis explained why machine learning alone cannot solve problems, without some domain expertise through his experience in internet security. Professor Amy Langville, the guru of ranking and also the author of “Who is #1”, “Google’s, Page rank and Beyond” presented on how to rank all kinds of data, from sport teams to movies. Netflix and Meetup, presented on how they do their recommendations, and the difference on the constraints and data availability they have.

Friday at 8:00 AM

Breakfast and Registration


Friday at 9:00 AMewaheadshot

Ewa Dominowska – Engineering Manager, Facebook

Bio:
Ewa Dominowska joined Facebook in spring of 2014 as an Engineering Manager focused on Science and Metrics for Online Advertising. Before coming to Facebook she designed a large scale predictive analytics platform for mobile devices as a Chief Architect at Medio Systems (acquired by Nokia). Prior to her start-up days, Ewa spent 10 years in various roles at Microsoft. At Microsoft, Ewa joined the Online Services Division to help found adCenter, the second largest online advertising platform in the US. Her work focused on real-time ad ranking, targeting, content analysis, click prediction, and pricing models. As part of the small yet dynamic original team, Ewa designed, architected, and built the alpha version of the contextual advertising product. In 2007, Ewa founded the Open Platform Research and Development team. As part of this effort, she organized the Beyond Search academic program, TROA WWW Workshop, and IRA SIGIR Workshop, resulting in a number of very successful collaborations between academia and industry. During her tenure in the Online Services Division, Ewa spent a year serving as the TA for Satya Nadella, where she advised and assisted in operation and planning for the division. The role encompassed architecture, technology, large-scale data services, and cross-organizational efficiency. Ewa was responsible for the intellectual property process, long-term strategy, and prioritization for the division. In 2010 Ewa started the adCenter Marketplace team responsible for all aspects of the advertising marketplace health and tuning. She architected and built a petabyte-scale distributed data and analytics platform and created a suite of marketplace and experimentation tools. Ewa earned her degrees in Electrical Engineering/Computer Science and Mathematics from MIT. Her research focused on machine learning, natural language processing, and predictive, context aware systems applied in the medical field. Ewa authored several papers and dozens of patents in the areas of online advertising, search, pricing models, predictive algorithms and user interaction.


Friday at 9:45 AMEvan Estola

Evan Estola – Data Scientist, Meetup.com

Abstract: Beyond Collaborative Filtering: using Machine Learning to power recommendations at Meetup
Collaborative filtering and other common recommendation algorithms are a powerful technique for some scenarios. I will cover how to design a recommendation system from the ground up using an ensemble classifier and supervised learning to avoid some of the pitfalls of collaborative filtering. From sampling to deployment, we’ve had to invent our approach with few non-academic and non-toy examples to follow. At Meetup we’re all about sharing information and empowering communities, so I’ll present the details of our model as well as some of the new features we are still developing.

 

Bio:
Evan is a Machine Learning Engineer at Meetup, where he is responsible for building intelligent systems that directly affect the user experience. Evan owns the recommendation engine at Meetup from data collection to production. Previously, Evan was on the Machine Learning Team at Orbitz Worldwide and he got his start in the Information Retrieval Lab at the Illinois Institute of Technology.


Friday at 10:10 AMAmy Langville

Amy Langville – Associate Professor of Mathematics, The College of Charleston in South Carolina

Abstract:
My talk will cover four ranking and clustering projects that I consulted on this past year. The projects range from ranking Olympic athletes, mixed martial arts fighters, and cell phone carriers to clustering sentences to rank individuals by how much humility they evidence in their written language. For each project, I will address the particular data challenges and the solutions and techniques we proposed.

Bio:
Amy is an Associate Professor of Mathematics at The College of Charleston in South Carolina where she regularly teaches graduate courses in Operations Research and Optimization and undergraduate courses in calculus and linear algebra. Her research focuses on ranking and clustering. She also enjoys solving applied mathematics problems from industry and has consulted with a variety of companies from large search engines and software companies to small start-ups and law firms engaged in patent infringement cases. Amy studied Operations Research for her PhD and web information retrieval for her postdoctorate at N.C. State University. When the surf’s up, Amy’s riding it. When it’s not, she’s training jiu-jitsu, peppering a volleyball, or biking around Folly Beach.


Coffee breaks provided byinsightpool logo
10:35 AM – 10:50 AM


Friday at 10:50 AMElizabeth.E.000

Elizabeth Elhassani – Director of Marketing Analytics and Insights, LexisNexis

Bio:
Elizabeth Elhassani joined LexisNexis Risk Solutions as Director, Marketing, Marketing Analytics & Insights in January 2012. In this newly created position within Marketing, Elizabeth is responsible for leading the design and implementation of short and long term analytic strategies to benefit all of our businesses. This includes, targeting and segmenting our client and prospect databases for effective demand generation, as well as working closely with our Marketing and Sales colleagues to track, analyze and report results of all customer-facing initiatives, both online and offline. An experienced marketing professional, Elizabeth brings more than 10 years of B2B and B2C analytics marketing experience to our ranks, with emphasis in designing statistical models, CRM strategies, segmentation schemes and cost benefit analyses. She was Associate Director for dunnhumby USA where she was responsible for scoping, pricing and designing consumer analytic insight projects for 10+ key consumer package goods clients utilizing many statistical methodologies to study customer behaviors including linear and nonlinear regression, CART/CHAID, ANOVA and cluster analysis. Prior to her work at dunnhumby, she was a Statistical Project Director for ChoicePoint Precision Marketing where she was responsible for consulting and directing projects for marketing analytics and acquisition models for external clients. In addition to her analytics expertise, she also brings an understanding of our industry with previous experience at Experian and Advanta Bank Business Cards.


Friday at 11:15 AMparikshit ram

Parikshit Ram – Senior Machine Learning Scientist, Skytree

Abstract: Max-kernel search: How to search for just about anything?

Nearest neighbor search is a well studied and widely used task in
computer science and is quite pervasive in everyday applications.
While search is not synonymous with learning, search is a crucial tool
for the most nonparametric form of learning. Nearest neighbor search
can directly be used for all kinds of learning tasks — classification,
regression, density estimation, outlier detection. Search is also the
computational bottleneck in various other learning tasks such as
clustering and dimensionality reduction. Key to nearest neighbor
search is the notion of “near”-ness or similarity. Mercer kernels form
a class of general nonlinear similarity functions and are widely used
in machine learning. They can define a notion of similarity between
pairs of objects of any arbitrary type and have been successfully
applied to a wide variety of object types — fixed-length data, images,
text, time series, graphs. I will present a technique to do nearest
neighbor search with this class of similarity functions provably
efficiently, hence facilitating faster learning for larger data.

Bio:
Parikshit Ram is a member of the technical staff at the machine learning startup Skytree (www.skytree.net) where he develops enterprise grade machine learning algorithms. Prior to this, Pari completed his doctorate in Computer Science at Georgia Tech in the School of Computational Science and Engineering where he was a member of the FASTlab and focused on developing fundamental algorithms and statistical tools for machine learning and data mining. Pari joined Georgia Tech in 2007 after completing his BS and MS in Mathematics and Computing in the department of Mathematics at Indian Institute of Technology, Kharagpur, India. Pari has also contributed to the open source machine learning library MLPACK (mlpack.org).


Friday at 11:40 AMSriSatish_Ambati

Sri Ambati – CEO & Founder, 0xdata

Bio:
Sri is co-founder and ceo of 0xdata (@hexadata), the builders of H2O. H2O democratizes bigdata science and makes hadoop do math for better predictions. Before 0xdata, Sri spent time scaling R over bigdata with researchers at Purdue and Stanford. Prior to that Sri co-founded Platfora and was the Director of Engineering at DataStax. Before that Sri was Partner & Performance engineer at java multi-core startup, Azul Systems, tinkering with the entire ecosystem of enterprise apps at scale.
Before that Sri was at sabbatical pursuing Theoretical Neuroscience at Berkeley. Prior to that Sri worked on nosql trie based index for semistructured data at in-memory index startup RightOrder. Sri is known for his knack for envisioning killer apps in fast evolving spaces and assembling stellar teams towards productizing that vision. Sri is a regular speaker in the BigData, NoSQL and Java circuit.


Friday at 12:05 AMsandy ryza

Sandy Ryza – Software Engineer, Cloudera

Abstract: Unsupervised Learning on Huge Data with Apache Spark
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Spark’s MLLib module contains implementations of several unsupervised learning algorithms that scale to large datasets. In this talk, we’ll discuss how to use and implement large-scale machine learning algorithms with the Spark programming model, diving into MLLib’s K-means clustering and Principal Component Analysis (PCA).

Bio:
Sandy Ryza is an engineer on the data science team at Cloudera.  He is a committer on Apache Hadoop and recently led Cloudera’s Apache Spark development.


LUNCH 12:30 AM – 1:00 PM


Friday at 1:00 PM

Networking


Friday at 1:40 PMbryanthompson.2.000

Bryan Thompson – Chief Scientist & Founder, SYSTAP, LLC

Abstract:
I will discuss current research on the MapGraph platform. MapGraph is a new and disruptive technology for ultra-fast processing of large graphs on commodity many-core hardware. On a single GPU you can analyze the bitcoin transaction graph in .35 seconds. With MapGraph on 64 NVIDIA K20 GPUs, you can traverse a scale-free graph of 4.3 billion directed edges in .13 seconds for a throughput of 32 Billion Traversed Edges Per Second (32 GTEPS). I will explain why GPUs are an interesting option for data intensive applications, how we map graphs onto many-core processors, and what the future looks like for the MapGraph platform.

MapGraph provides a familiar vertex-centric abstraction, but its GPU acceleration is 100s of times faster than main memory CPU-only technologies and up to 100,000 times faster than graph technologies based on MapReduce or key-value stores such as HBase, Titan, and Accumulo. Learn more at http://MapGraph.io.


Friday at 2:00 PMjustinbasilico

Justin Basilico – Research/Engineering Manager, Netflix

Abstract: Learning to Personalize
Netflix instant video streaming represents an estimated one third of peak broadband traffic in the US. Personalization is at the core of our product with recommendations driving about 75% of all viewing. Building a high-quality recommendation system for millions of users requires a careful balancing act of handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. In this talk, I will discuss how we use machine learning to drive our recommendation approach. I will describe some of the data, algorithms, metrics, and experimental methodology we use to effectively apply machine learning at scale. I will also highlight the evolution of our personalization approach from rating prediction to ranking to page generation.

Bio:
Justin Basilico is a Research/Engineering manager for Page Algorithms Engineering at Netflix. He leads an applied research team focused on developing the next generation of algorithms used to generate the Netflix homepage through machine learning, ranking, recommendation, and large-scale software engineering. Prior to Netflix, he worked on machine learning in the Cognitive Systems group at Sandia National Laboratories. He is also the co-creator of the Cognitive Foundry, an open-source software library for building machine learning algorithms and applications.


Friday at 2:25 PM<tao_head_pandora

Tao Ye – Senior Scientist, Pandora

Abstract:
Pandora is best known for the Music Genome Project, the most unique and richly labelled 1.5 million+ song data. Naturally a content based approach to music recommendation is used as the foundation to our online radio service. Over the years we have improved and transformed the recommendation platform to incorporate multi-facted data and models on this foundation. Combined with a dynamic ensemble system, this platform now powers the most popular streaming music service in the U.S., with 77 million+ monthly active users. In this talk I will discuss the music recommendation topics we work on at Pandora such as our ever evolving machine learning tasks, give examples on user modeling tasks, and share challenges we still face.

Bio:
Tao Ye is a Sr. Scientist on the Pandora playlist team since 2010, working on research driven system building for recommendation systems, measurements and user modeling. She has 15 years of experience in the software industry, holding research scientist and lead engineer positions in social media, networking and mobile systems. She holds 11 granted patents and has published 12 peer reviewed papers. She received a Master’s degree from UC Berkeley in Computer Science and duo Bachelor’s degrees from State University of New York at Stony Brook in Computer Science and Engineering Chemistry.


Friday at 2:50 PMivy

Xia Zhu – Intel

Abstract: Streaming and Online Algorithms for GraphX

GraphX is a resilient distributed graph processing framework on Apache Spark. It is designed for, and is good at, analysis of static graphs. However, it does not support analysis on time evolving graphs yet. In this talk, I will present graph processing research on streaming enhancements for GraphX, which may be used in both pure stream processing or lambda architectures. I will describe an architecture design, and demonstrate how it works with three machine learning algorithms, with detailed evaluation and analysis on performance and scalability.

Bio:
As a research scientist at Intel Corporation, Xia (Ivy) Zhu works on graph analytics to provide users with end to end solution which includes but not limited to graph ETL, graph building and machine learning. Prior to joining Intel Labs in 2005, Ivy worked as senior scientist at Philips Research East Asia. She holds a Doctorate in Computer Science, and holds 13 patents.


Friday at 3:15 PMJacob Mundt pic

Jacob Mundt – Chief Technology Officer, eBrevia

Bio:
Jacob Mundt is the CTO at legal tech startup eBrevia, applying information extraction and summarization to the text of legal documents and contracts. eBrevia provides software tools that help attorneys to speed their review of legal documents while increasing accuracy. Previously Jacob researched summarization, machine translation, and information extraction under Kathleen McKeown at Columbia University, and led the Research and Development team at Outcome Sciences (acquired by Quintiles) to improve patient health outcomes through collection of clinical data from hundreds of hospitals. He holds a Bachelor of Science from Rice University and a Master of Science from Columbia.


Coffee breaks provided byinsightpool logo
(Book Giveaways) 3:40 PM – 4:10 PM


Friday at 4:10 PMemanuel

Manos Antonakakis – Assistant Professor of Computer Systems and Software, Georgia Tech

Abstract: So, you think you can model Internet abuse with machine learning?

Abuse in the Internet is an every day problem. Illicit actors are
victimizing people, which result to a variety of significant problems
— i.e., from losing your private information to have your recourses
being used in other criminal activities. The common denominator
behind the Internet abuse is a network of infected machines (a.k.a.
botnet) under the control of the criminal entity (a.k.a. botmaster).
Needless to say, the detection of such “botnet communications” is in
the hurt of the security problem that a large organization faces every
day. Detection methods based on static methods are doomed fail,
simply because they will always be behind the threat. Thus, the
community is in great need of scalable abuse detection solutions.

Unsurprisingly, such newly proposed solutions are often based on
machine learning. With this talk I will argue that a fancy
machine-learning algorithm (and derived pretty graph pictures)
“operationally” will simply not “cut-it”. This is true especially in
the case where what you are trying to solve is not your company’s
marketing problem, rather the security problem your network and
security operation center is facing every day. The role of domain
knowledge and constant counter intelligence of the malicious actors is
fundamental to properly craft generic detection and attribution
solutions able to catch up with the constantly changing malicious
methodologies, while at the same time you minimize the false and
missed detections.

Bio:
Manos Antonakakis received his engineering diploma in 2004 from the University of the Aegean, Department of Information and Communication Systems Engineering. From November 2004 up to July 2006, he was working as a guest researcher at the National Institute of Standards and Technology (NIST-DoC), in the area of wireless ad hoc network security, at the Computer Security Division. Before joining the ECE faculty, Dr. Antonakakis held the chief scientist role at Damballa, where he was responsible for advanced research projects, university collaborations, and technology transfer efforts. He currently serves as the co-chair of the Academic Committee for the Messaging Anti-Abuse Working Group (MAAWG). In May 2012, he received his Ph.D. in computer science from the Georgia Institute of Technology under Wenke Lee’s supervision. In his free time, he enjoys watching and playing soccer.


Friday at 4:35 PMdanaikoutra

Danai Koutra – CMU/Technicolor Researcher, Carnegie Mellon University

Abstract:
Networks naturally capture a host of interactions in the real world spanning from friendships to brain activity. But, given a massive graph, like the Facebook social graph, what can be said about its structure? Which are its most important structures? How does it compare to other networks like Twitter? This talk will focus on my work developing scalable algorithms and models that help us to make sense of large graphs via pattern discovery and similarity analysis.

I will begin by presenting VoG, an approach that efficiently summarizes large graphs by finding their most interesting and semantically meaningful structures. Starting from a clutter of millions of nodes and edges, such as the Enron who-mails-whom graph, our Minimum Description Length based algorithm, disentangles the complex graph connectivity and spotlights the structures that ‘best’ describe the graph.

Then, for similarity analysis at the graph level, I will introduce the problems of graph comparison and graph alignment. I will conclude by showing how to apply my methods to temporal anomaly detection, brain graph clustering, deanonymization of bipartite (e.g., user-group membership) and unipartite graphs, and more.

Bio:
Danai Koutra is a final-year Ph.D. candidate at the Computer Science Department at Carnegie Mellon University. Her research interests include large-scale graph mining, graph similarity and matching, graph summarization, and anomaly detection. Danai’s research has been applied mainly to social, collaboration and web networks, as well as brain connectivity graphs. She holds 1 “rate-1” patent and has 6 (pending) patents on bipartite graph alignment. Danai has multiple papers in top data mining conferences, including 2 award-winning papers, and her work was covered by popular press, such as MIT Technology Review. She has also worked at IBM Hawthorne, Microsoft Research Redmond, and Technicolor Palo Alto/Los Altos. She earned her M.S. in Computer Science from CMU 2013 and her diploma in ECE at the National Technical University of Athens in 2010.


Friday at 5:10 PMHassan

Hassan Chafi – Research Manager, Oracle Labs

Abstract: PGX: An In-Memory, Parallel Graph Analytic and Query Engine

 

Brief Description:
In-memory (and distributed) graph analytic engine that is tightly coupled with a relational database.

 

Long Description/Abstract:
We present a graph processing system in which a graph database is tightly integrated with a graph analytic engine. Our graph database, based on existing NoSQL and relational databases, provides scalable management of graph data for transactional workloads. Our graph analytic engine, on the other hand, enables rapid execution of analytic workloads. We first introduce PGX, our in-memory graph analytic engine which initially loads up the graph data from the database and periodically synchronizes afterward. The parallel execution engine of PGX is very efficient – e.g. counting triangles in billion-edge graphs in 2 minutes. The users can also submit their custom graph algorithms written in a domain-specific language; PGX automatically parallelizes them for execution. Then we introduce PGX.DIST, our distributed graph analytic engine. We show that PGX.DIST is up to orders of magnitude faster than the state-of-art graph analytic engine. The DSL compiler can help running the same algorithm on both PGX and PGX.DIST, transparently.
* Graph database tightly integrated with graph analytic engine
* Fast, parallel in-memory graph analytic engine
* Distributed graph analytic engine
* Use of Domain-Specific Language for graph analytics.


Friday at 5:35 PMdan mallinger

Dan Mallinger – Data Science Practice Manager, Think Big Analytics

Abstract: Organizing for Data Science
This talk will introduce a paradigm for enabling access to large, unstructured, and novel datasets in enterprises, while retaining value from existing tools and staff. By following a real world example, the discussion will walk through how small, central data science teams can make data discoveries and data value accessible to others. We will also review the tools, data science approaches, and best practices to uncovering, polishing, and digesting signal in data to support analytics at the front lines of business.

Bio:
Dan Mallinger is the Data Science Practice Manager for Think Big Analytics. He has deep experience enabling analytics at enterprises and implementing data science solutions, having helped many of the Fortune 100. Dan has extensive experience working with product, business, and marketing teams across a wide variety of industries. His work with them has been focused on driving value from multi-structured and unstructured data sets. He is formally trained in statistics, computer science, and organizational psychology & leadership.


Friday at 5:55 PM

Thank Yous, Book Giveaways, and Wrap-Up

Sponsored by:
Gold Sponsors:
      
Silver Sponsors:
             

Books Provided By:


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s