Press Articles


IT Business Net Logo

SAN FRANCISCO, CA — (Marketwired) — 07/17/14 — Trifacta, a Data Transformation platform provider, today announced that CEO and co-founder, Joe Hellerstein, will discuss Trifacta’s research on domain-specific languages and how they can facilitate the data transformation process at the GraphLab Conference on Monday, July 21st. The session is titled, “Data, DSL and Transformation: Research and Practice.” Other speakers include Carlos Guestrin, founder and CEO of GraphLab, Milind Bhandarkar, chief data scientist of Pivotal, Josh Wills, director of data science of Cloudera.

“Starting back with Potter’s Wheel, our research has demonstrated that a carefully designed domain-specific language (DSL) for data transformation can significantly improve productivity for experts and accessibility for non-experts,” said Joe Hellerstein. “Recently at Trifacta, we’ve had an opportunity in a commercial setting to implement a DSL with our Wrangle language and observe first hand the advantages in day-to-day use. We’re looking forward to sharing with the community some of our thoughts and findings.”

What:
GraphLab Conference 2014</span

Where:
Hotel Nikko,
222 Mason Street
San Franciso, CA 94102

When:
Monday, July 21, 2014
8:00am – 7:00pm PT

GraphLab Conference is a gathering of 900 data scientists, software engineers and big data analytics thinkers from a variety of companies, academic institutions and organizations leading the way in the big data space. The conference started as a meeting of experts to discuss graph analytics and large-scale machine learning in 2012 and has grown to the largest conference for graph analysis. GraphLab has partnered with geekSessions to produce their third annual conference.

View the full article here!


Ulitzer Logo

SAN FRANCISCO, CA — (Marketwired) — 07/17/14 — Trifacta, a Data Transformation platform provider, today announced that CEO and co-founder, Joe Hellerstein, will discuss Trifacta’s research on domain-specific languages and how they can facilitate the data transformation process at the GraphLab Conference on Monday, July 21st. The session is titled, “Data, DSL and Transformation: Research and Practice.” Other speakers include Carlos Guestrin, founder and CEO of GraphLab, Milind Bhandarkar, chief data scientist of Pivotal, Josh Wills, director of data science of Cloudera.

“Starting back with Potter’s Wheel, our research has demonstrated that a carefully designed domain-specific language (DSL) for data transformation can significantly improve productivity for experts and accessibility for non-experts,” said Joe Hellerstein. “Recently at Trifacta, we’ve had an opportunity in a commercial setting to implement a DSL with our Wrangle language and observe first hand the advantages in day-to-day use. We’re looking forward to sharing with the community some of our thoughts and findings.”

What:
GraphLab Conference 2014</span

Where:
Hotel Nikko,
222 Mason Street
San Franciso, CA 94102

When:
Monday, July 21, 2014
8:00am – 7:00pm PT

GraphLab Conference is a gathering of 900 data scientists, software engineers and big data analytics thinkers from a variety of companies, academic institutions and organizations leading the way in the big data space. The conference started as a meeting of experts to discuss graph analytics and large-scale machine learning in 2012 and has grown to the largest conference for graph analysis. GraphLab has partnered with geekSessions to produce their third annual conference.

View the full article here!

oreilly

The rise of sensors and connected devices will lead to applications that draw from network/graph data management and analytics. As the number of devices surpasses the number of people — Cisco estimates 50 billion connected devices by 2020 — one can imagine applications that depend on data stored in graphs with many more nodes and edges than the ones currently maintained by social media companies. This means that researchers and companies will need to produce real-time tools and techniques that scale to much larger graphs (measured in terms of nodes & edges). I previously listed tools for tapping into graph data, and I continue to track improvements in accessibility, scalability, and performance. For example, at the just-concluded Spark Summit, it was apparent that GraphX remains a high-priority project within the Spark1 ecosystem. Another reason to be optimistic is that tools for graph data are getting tested in many different settings. It’s true that social media applications remain natural users of graph databases and analytics. But there are a growing number of applications outside the “social” realm. In his recent Strata Santa Clara talk and book, Neo Technology’s founder and CEO Emil Eifrem listed other uses cases for graph databases and analytics:

  • Network impact analysis (including root cause analysis in data centers)
  • Route finding (going from point A to point B)
  • Recommendations
  • Logistics
  • Authorization and access control
  • Fraud detection
  • Investment management and finance (including securities and debt)

The widening number of applications means that business users are becoming more comfortable with graph analytics. In some domains network science dashboards are beginning to appear. More recently, analytic tools like GraphLab Create make it easier to unlock and build applications with graph2 data. Various applications that build upon graph search/traversal are becoming common, and users are beginning to be comfortable with notions like “centrality” and “community structure”. A quick way to immerse yourself in the graph analysis space is to attend the third GraphLab conference in San Francisco — a showcase of the best tools3 for graph data management, visualization, and analytics, as well as interesting use cases. For instance, MusicGraph will be on hand to give an overview of their massive graph database from the music industry, Ravel Law will demonstrate how they leverage graph tools and analytics to improve search for the legal profession, and Lumiata is assembling a database to help improve medical science using evidence-based tools powered by graph analytics.

View the full article here!

 

linkurious-239x60-tr

If you are working with graph data, GraphLab should pick your interest. It is building a machine learning analytics engine for graph datasets. Started in 2009 as an open source project by Carlos Guestrin, the software is used daily for millions of recommendations in popular consumer services. To continue and support the development of GraphLab, GraphLab Inc has raised $6,75M in 2013. Today GraphLab already has several implemented libraries of algorithms, including Topic ModelingGraph Analytics, Clustering, Collaborative Filtering, Graphical Models or Computer Vision. Its mains features include:

  • a unified multicore and distributed API: write once run efficiently in both shared and distributed memory systems;
  • tuned for performance: optimized C++ execution engine leverages extensive multi-threading and asynchronous IO;
  • scalable: GraphLab intelligently places data and computation using sophisticated new algorithms;
  • HDFS Integration;
  • powerful Machine Learning Toolkits

If you are interested in machine learning and working with graph data, like Wallmart or Pandora, GraphLab could help you get more out of your data. On July 21st in San Francisco will be the third annual edition of the GraphLab conference. What started as a small gathering a few years ago has grown into one of the biggest event in the graph community. Among this year’s projected 900 in attendance are data scientists, software engineers and big data analytics thinkers from the companies, academic institutions and organizations leading the way in this space. The conference is focused on fostering knowledge and creating connections within the graph community. Among this year’s speakers are Prof. Carlos Guestrin (Founder & CEO, GraphLab), Prof. Joe Hellerstein (Founder & CEO of Trifacta), Josh Wills (Director of Data Science, Cloudera) or Karthik Ramachandran and Erick Tryzelaar from Lab41. Attendants will learn about real use cases from Pandora (“Large scale music recommendation @ Pandora : design tradeoffs”) or Adobe (“Algorithms for Creatives Talent Search using GraphLab”). Some of the leading companies in the field of big data will be in San Francisco to present their own technologies and their views on graph data. With around 50 different companies or institutions this year, it is nice to see that the graph ecosystem is growing fast. There are some familiar names for those who follow graph technologies like Neo4j (graph database), MongoDB (no-sql database), Titan (graph database, Ayasdi (data analysis). There will also be a few graph visualization companies like Cambridge Intelligence, Cytoscape or Tom Sawyer Software. You can also expect to meet a few new-comers like BigML, PredictionIO, sqrrl or Dataiku (a French startup specialized in ML). We are proud to be part of this year’s event. Sébastien Heymann, CEO of Linkurious and co-founder of Gephi, will be in San Francisco to present the next version of Linkurious Enterprise. If you are in San Francisco around that time, come say hello!

View the full article here!

 

kdnuggets logo

KDnuggets readers can save $30 from already low registration fee – use “KDNuggets” promotional code when registering for the conference at gl3.eventbrite.com. GraphLab Seeks to break attendance records for bringing together Machine Learning Experts and Data scientists for their third annual conference this July. What started as a modest goal of bringing experts in graph analytics and large scale machine learning together is now in its 3rd year and takes the form of a full blown conference. We couldn’t be more grateful or prouder to bring you the GraphLab Conference 2014. Among this year’s attendance are data scientists, software engineers and big data analytics thinkers from the companies, academic institutions and organizations leading the way in this space. Graph Databases: The event will host demos from influential graph databases such as: Neo Technology (Neo4j), Aurelius (Titan), Franz, Objectivity (InfiniteGraph), and Sparsity Technologies. Graph Visualization: Trifacta, Cambridge Intelligence, Graphistry (viz using gpus), Linkurious (a startup from the creators of Gephi open source), Tom Sawyer Software, and Plot.ly will present their recent work in helping data scientists visualize their data. Python tools: The GraphLab Conference will host presenters and demos from Skytree, bigML, Zipfian Academy (python training), Continuum Analytics, iPython, Domino Data Labs, Dataiku. Academic Demos: The conference will also host an interesting academic presence including Petuum (CMU) a new system by Prof. Eric Xing, Parameter Server (CMU) a mega scale framework for cluster implementation of ML methods by Prof. Alex Smola. Grappa (UW) by mark Oskin from UW, a super fast graph analytic framework. Stinger – a streaming graph system from Georgia Tech. Graph Use Cases: Senzari, a company based in Florida is creating the largest music graph – with 100 billion facts related to music! Ravel law is using graphs obtain by supreme court rules to deduce interesting and useful facts about law. Lumiata is compiling a healthcare graph for medical science based graph analytics. Crosswise is using graphs for security and entity disambiguation purposes.

View the full article here!

 

iCrunchData logo

The 3rd annual GraphLab Conference on Monday, July 21st started as a modest goal 3 years ago of bringing graph analytics and machine learning experts together to discuss the latest news. Fast forward 3 years and it is now the premier graph analytics event that is expected to attract over 900 Data Scientists, Software Engineers and Big Data Analytics Thinkers all converging in San Francisco at the Hotel Nikko. The conference will include presentations from the Founders of Trifacta, DataPad and Databricks as well as Chief Scientists from Cloudera, Pivotal, Pandora, Google Research and more. If you are in the graph analytics space and want to hone your data science skills by learning from the best, get to the GraphLab Conference and get ready for:

  • GraphLab Strategy, Vision and Practice
  • The Zoo Expands: Labrador Hearts Elephant Thanks to Hamster
  • What Comes After The Star Schema
  • Fast Medium Data Analytics at Scale
  • Machine Learning and Graph Computation on Spark
  • ASYMP: Fault-tolerant Graph Mining via ASYnchronous Message Passing
  • REEF: Towards the Big Data stdlib
  • Dendrite Large Scale Graph Analytics
  • Algorithms for Creatives Talent Search using GraphLab
  • Scaling Distributed Machine Learning with the Parameter Server

Day 2 of this event is a full day of hands-on training to teach participants how to build a machine learning system at scale from prototype to production using GraphLab Create. Check out the details of this two day event starting with the GraphLab Conference on graph analytics and concluding with a full day of hands-on training.

View the full article here!

gigaom-logo

Webscale companies such as Facebook, Google, and Netflix have come clean about how they use graph processing to quickly reveal the seemingly disparate connections among people, places and things. And more use cases for graph databases emerged Monday at the 2013 GraphLab workshop in San Francisco. But even though it became clearer what’s possible when data is organized in graphs — better e-commerce and Twitter follower recommendations and lighter infrastructure usage, for example — some speakers pointed to the need for graphs and machine learning to become easier to implement. Graphs at scale at Twitter and Walmart Twitter’s Who to Follow tool is a fine example of a product benefiting from a graph model for data. Who to Follow depends on the FlockDB graph database and the Cassovary in-memory graph-processing engine Twitter constructed in-house and then released to everyone under an Apache License. The product mines existing connections among users, shared interests and other data in order to makes its recommendations with data in a graph that can run inside the memory of a single server. Take it as proof that the graph model can provide advantages over a more traditional relational model for certain kinds of applications. The system’s success over the past three years demonstrates that it’s not only possible but preferable for a graph to run in a single instance of memory, said Pankaj Gupta, head of the personalization and recommender systems group at Twitter. Lei Tang, a data scientist at @WalmartLabs, talked about how he’s been working on drawing on lots of data sources to recommend products to website users that they might actually want to buy. A smart recommendation system ought to shift in response to incoming data on, say, a user’s page views and purchases, he said. This is where clustering of products can be wise. So while a user might view a bunch of televisions before ultimately buying one, the cluster of television products within the larger set of products the system can recommend should be set aside as soon as the purchase happens. Recommend a television with big discounts after the purchase, Tang said, and “users are really pissed off.” Also, in the domain of e-commerce it’s important to add nuance into recommendations. For example, a good recommendation system would suggest to users a primary product such as an iPhone before showing accessories such as a case or earphones. So companies should make those page views and other data count and focus on granular product categories in order to maximize purchases through recommendations. And these sorts of fine-grained tweaks need to be made quickly for millions of users, so the system can’t be too computationally intensive. Tang and his colleagues appear to have come up with a scalable system that meets these requirements, although he said there’s still room for improvement. Coming soon to a server near you? More use cases emerging for graphs could motivate more companies to try out the graph model. And that means more business opportunities. The namesake of the GraphLab workshop, the GraphLab open-source graph project with roots at the University of Washington, spun off a startup just a couple of months ago. Now another project from the university, Grappa, is spinning off a startup, too, said Mark Oskin, a professor at there. Grappa aims to ensure that performance stays strong as graphs running in memory on a whole bunch of commodity servers, while at the same time making the most of network bandwidth. GraphLab, for its part, announced the release of version 2.2 of its software, which makes it easier for developers to write machine-learning programs, said its founder and CEO, Carlos Guestrin. While webscale companies are already reaping the benefits of storing data in graph models and the market shows room for growth, adoption across enterprises might take some years yet. Dr. Theodore Willke of Intel Labs is a believer in the graph — he has worked on the GraphBuilder system for making graphs out of data in Hadoop — but he thinks his contemporaries pushing graph analytics are far ahead of the rest of the world in terms of getting people on board. He, too, is “guilty of being in this rocketship going at, like, warp nine,” he said, with innovators having huge clusters to work on and intense performance needs. “Most of the industry is, like, miles behind you,” Willke said. Engineers at companies are still getting on board with doing MapReduce jobs, he said. Now it’s important to articulate the clear business uses of graphs that demonstrate their value. To help with that, Willke said he intends to focus his efforts on making graphs easy to use and integrate with other computational models.

View the entire article here!

 

InfoQ Logo

MLConf was going strong on Friday April 11 in New York City. This was the first ever MLConf happening in NYC, and it was met with a resounding success, so much that it quickly sold out. The conference was a full day packed with sessions around Machine Learning, and a good portion of the talks also talked about Big Data topics and how to apply this machine intelligence at very large scale. This is another good indication that the field of Data Science is stronger than ever, and that today’s data scientists need a mix of two skills: traditional research skills with a background in mathematics and statistics, and more engineering skills to be able to work with the most popular Big Data frameworks. Machine Learning and Big Data Corinna Cortes from Google kicked off the conference with a description of how some of Google’s services are built, with a focus on how they scale to millions of users. Take for example the problem of image browsing, where Google is applying research in image processing to cluster similar images together in its product titled Google Image Swirl. According to Corinna, this is done by computing all the pair-wise distances between related images and form clusters, but at Google scale this becomes quickly impractical. To solve that, Google started representing images by short vectors using the Kernel-PCA algorithm, and making random projections to grow the resulting tree top-down. At query time, all that is left is browing the tree to return the related images. Another example described by Corinna is the Google Flu Trends where they are essentially looking at correlation between query searches. This is similar to another product called Google Correlate which allows users to correlate search patterns with arbitrary real-world events. Corinna described using the K-means algorithm to solve this problem by splitting time series into smaller chunks, and representing each chunk with a set of cluster centers. Ted Willke from Intel came on stage to stress an important paradigm shift: we are living in a world where context and semantics matter and are being analyzed more and more, for example through the use of RDF. Scaling at web scale remaing a challenge, and Ted described using the Titan distributed database to store RDF data, and applying the LDA algorithm that can be simply expressed using Gremlin. In a very different talk, Yael Elmatad from Tapad described the algorithms involved in the construction of Tapad’s device graph that links consumer devices together and contains around 2 Billion nodes representing more than 100 Million households and 250 Million individuals. The weighting algorithm for the edges between devices in particular took several tries to get right. Focusing initially on segmentation data (traits associated to various individuals), Yael found that this approach performed poorly, barely 10% better than a random guess because of the nature of segment data itself that is filled with randomness and noise, and also the fact that long-lived devices tend to accumulate a lot of segments. Instead of trying to correct the biases, Yael took a different approach to use Tapad’s in-house browsing data by considering unique domains, even if that data is harder to come by and so much sparser. The results were much more encouraging and performed 40% better than random, which is close to ideal since Tapad considers 50% better than random to be the benchmark. Yael makes an interesting statement about what data should be considered when building models:

The best pieces of data may be scarce and raw because they are often less fraught with hidden biases and unnecessary processing.

Edo Liberty from Yahoo came on stage to describe a type of streaming computational model which is often useful when dealing with large-scale data when memory is limited. In the context of email threading for Yahoo Mail, Edo was tasked to come up with an algorithm that would scale to Millions of users with limited resources. Edo used the Misra-Gries algorithm described in these lecture notes by clustering all available emails, and computing conditional probabilities to determine how likely two emails can occur one after another. This method goes back to a common theme about trading accuracy for space. In terms of scalability there are many databases on the market like Amazon Redshift which are built for performance and try to be a generic platform to fit all use cases. According to Shan Shan Huang from LogicBlox, there are times when using a database built for a very specific purpose can outperform a generic database like Redshift. LogicBlox implemented their own join algorithm in the form of the leapfrog triejoin which is a kind of worst-case optimal join, and is available in their platform.

View the entire article here!

 

hireiqlogo200300x103

ATLANTA, GA — (April 8, 2014) — HireIQ Solutions, the innovative leader in predictive analytics and virtual talent acquisition solutions for customer-facing organizations, today announced that Todd Merrill, the company’s chief technology officer, is presenting at the upcoming MLConf. Todd’s presentation will introduce the company’s approach to improving the employee selection process, especially for customer-facing positions, by predicting a job candidate’s likelihood of retention and on-the-job performance. MLConf 2014 brings together many of the leading practitioners of advanced machine learning, big data and predictive analytics platforms, algorithms and applications for a one-day intensive networking session. Delegates will share best practices and hear about the latest innovations in machine learning. The program includes speakers from other leading machine learning and big data practitioners such as Google, Intel, Yahoo!, and Netflix. For further information about this year’s event, please visit: http://mlconf.com/. HireIQ’s innovative use of machine learning and predictive analytics effectively predicts which job candidates are more likely to remain employed and perform well once on the job. The company’s approach uses pre-hire assessment data from virtual interviews, tests and assessments, coupled with observed performance outcomes to identify the job candidates that are most likely to succeed. “Employee retention and job performance are important for all organizations, but especially for companies that run customer service operations, where attrition is a particularly vexing problem,” said Todd Merrill, HireIQ’s chief technology officer. “Our ability to reliably and accurately predict those job candidates who are likely to be excellent performers have significant implications for these companies in terms of profitability, customer satisfaction and employee engagement. HireIQ is the only company in the talent acquisition market that has adopted such an approach.” HireIQ transforms the talent acquisition process for customer-facing organizations by linking stakeholder-observed business outcomes, such as operational performance and retention with the results of pre-hire digital interviews. As a result, companies improve their hiring decisions, reduce the critical time-to-fill interval, lower recruiting costs and increase employee retention and performance.

View the entire article here!

 

0xdata

This Friday H2O will be at MLconf (http://mlconf.com) to give a live demo, introduce a customer use case, and talk about the implications of model specification in production. If you don’t get a chance to stop by our booth, or come see our demo, you can find the presentation slides on the MLconf website (they will be posted on Friday, April 11). We’ll be walking through a practical example of different model choices and outcomes using the same predictors and target values. Generalized linear models have the benefit of interpretability and easy scaling of estimated parameters. When data are overly complex, GLM can easily be linked to PCA, or regularization can be applied to simplify the model. On the other hand, our Gradient Boosted Machine handles both classification and regression, takes a non-parametric approach, and quickly models even the most complex of interactions. Interpretability can be a bit harder, but identifying critical components is far easier with variable importance returned with the model output. You can also hear from our client, Collective, on their experiences with real world application in an H2O use case.

View the entire article here!

 

CMSWireLogo

Machine learning and big data processing are ubiquitous in the operations of every modern company. New machine learning algorithms platforms and applications have emerged in industry and academia. Come and learn about the most fascinating advances in ML from the experts. The head of Google Research NY, Corinna Cortes will present the largest scale ML deployment, while Claudia Perlich a serial KDD cup champion will discuss her experience in practical machine learning. We’ll host a series of talks on bleeding-edge ML-platforms, like: H2o, Logicblox, Cloudera and Ayasdi. Samantha Kleinberg will present her new interesting results with inference from uncertain data in diabetes using continuous sensor data with a tie in to google glass. Animashree Anandkumar will introduce you to the world of tensors, teaching you how you to apply the use of tensors to solving machine learning problems. Join us on Friday, April 11th in New York City for a full-day conference on Machine Learning. We’ll have Presentations from Google, Yahoo, Cloudera, Ayasdi, 0xdata, and many more.

View the entire article here!

 

MasterStreet Logo

MLconf NYC 2014 Join us on Friday, April 11th in New York City for a full-day conference on Machine Learning. We’ll have Presentations from Google, Cloudera, Ayasdi, 0xdata, and many more. Follow @MLconf for updates, discounts and free tickets! MLconf is a meeting place for both academia and industry to discuss upcoming challenges of large scale machine learning and solution methods. Presentations from: Corinna Cortes, Head of Research, Google Josh Wills, Director of Data Science, Cloudera Ted Willke,Principal Engineer and GM of the Graph Analytics Operation, Intel Labs Edo Liberty, Research Scientist, Yahoo Justin Basilico, Senior Researcher/Engineer in Recommendation Systems at Netflix Claudia Perlich, Chief Scientist, Dstillery Pek Lum, Chief Data Scientist, Ayasdi Shan Shan Huang, VP Product Management, LogicBlox, Inc. Yael Elmatad, Data Scientist at Tapad Irene Lang, Math Hacker, 0xdata Anqi Fu, Data Scientist, 0xdata Samantha Kleinberg, Computer Science department, Stevens Institute of Technology Animashree Anandkumar, Electrical Engineering and Computer Science Dept, UC Irvine.

View the entire article here!

 

SHARETHROUGH LOGO

This past Friday 11/15, Sharethrough Engineering attended the 2013 MLConf here in San Francisco where Netflix, Twitter, Yelp and others presented on large-scale ML trends and challenges.

Michael R.’s Thoughts

For me a real highlight was the talk “Big Data Lessons in Music” given by Eric Bieschke, Chief Scientist at Pandora. Keeping algorithmics to a minimum, he explored the importance of of choosing the right metric: ‘How you judge experiments shapes where you are headed; choose the wrong measuring stick and you wind up in the wrong place.’ Jake Mannix of Twitter discussed content-based approaches to recommendation systems, particularly as an antidote to cold-start problems (e.g. a new ecommerce site that doesn’t yet have ratings). I’m not sure I would want to employ that kind of strategy, since latency would rocket skyward as the item set grows, but he pointed out that Twitter has a cold-start problem with new users (whose graph is small), and that their recsys employs a hybrid of content-based and collaborative filtering-based recommendations. Quoc V. Le, of Google and Stanford, gave an interesting, though slightly schematic overview of Google’s DistBelief, which performs unsupervised learning across tens of thousands of cores. Quoc described its uses in image recognition and voice search, but I suspect that these applications look pretty tame compared to what Google plans to do with it.

Michael J.’s Thoughts

Netflix

Netflix’s previous business model was all about DVD rentals by mail, so users were much more picky about what they put in their queue, and found a lot of value from awesome recommendations. The cost of a bad recommendation was high, as users would have to wait several days to get the DVD, watch it, return it and rate it. Now, Netflix is more about streaming, with 40 million users, there were 5 billion hours of video streamed in Q3 of 2013. The bottleneck of delivering DVDs in the mail is gone, and Netflix now gets about 5 million ratings every day. Users now make impulsive decisions about what to watch, and are happy to abandon content if it isn’t to their liking. Because users can watch many different pieces of content with relatively little investment, the “search space” of movies and TV shows can be explored much more quickly. Modern Netflix recommendations are all about grouping content by similarity across many vectors. User Behaviour can be a strong indicator of how other people consume media (Other people who watch Parks and Recreation also watched The Office), but tagging content with metadata can lead to very interesting categories for recommendations. The Netflix home screen now features many rows of content, grouped by similarity. Genres are generated from pools of tags, leading to collections of Independent Comedies with a Strong Female Lead. Netflix is probably the gold standard for recommendations right now, with LinkedIn’s “People you may know” a close second. Their powerful system provides such good recommendations that only 25% of views are started from a search, the other 75% coming from a piece of recommended content.

Yelp Recommendations

In the Yelp iOS app, the “nearby” tab used to be a facade for search. A small project team was broken off with about three engineers and some front end developers to work on the new Nearby tab. Unlike at Netflix, context is very important, a user’s location or the time of day will eliminate 95% – 99% of Yelp’s business database as valid results. In the initial version, the team had very little ML experience. They knew they would have all of Yelp’s users and data on day one with no gradual rollout! They had a small team, and they were hoping this product would be long lived, so they had to build for the future. The following principles were distilled:

  • There was a big data retrieval problem (95-99% of data is useless for each request), database choices were limited by the domain (fast geo search, time of day filtering etc)
  • Build for what you have, but plan for expansion
  • Goal is a great product, not a benchmark

The architecture works by fanning out search requests to systems called Experts. Each Expert will apply it’s algorithms and provide zero or more recommendations. One example is the “Liked by Friends” expert which uses user data to find what your friends like and recommend those results, with location and time of day filtering applying to all experts. The system aggregates the results from each Expert and makes a “wise” decision for this user, by ranking the Expert’s recommendations. A good example is if you’re traveling to a new city and the “Liked by Friends” expert suggests a coffee shop, users will be very likely to take the recommendation since it’s unexpected. Learnings:

  • Solve your own problem (e.g. Netflix ML won’t help at Yelp)
  • Build for what you have, plan for the future
  • Internal iterations pre-launch are very useful (dogfooding).

Ryan’s Thoughts

The majority of ML presented was being applied to recommendation systems in various forms. Three themes emerged:

  1. First was the increased interest in so called “deep learning” which is neural networks with more than 3 hidden layers. This was by far the most popular topic in the more academic track of talks. In particular there was a focus on how to implement the distributed computing necessary to compute these huge neural networks on large data sets.
  2. The second major theme was the “productionalization” of recommendation systems. This focused on picking the right metrics to optimize for and choosing algorithms that were understandable in production. Scott Triglia from Yelp talked about how they choose simpler algorithms that can be more easily composed and monitored, which allows complexity to be introduced iteratively without creating a black box. Both Pandora and Netflix both discussed the importance of finding the right error function to optimize your recommendations towards. There was a particularly impactful comment from Xavier Amatriain of Netflix: ‘Social popularity is the baseline for all recommnendation systems, and if you can’t beat ‘most popular’ then you need to go back to the drawing board.’ He also cited evidence that showed popular was not a great recommendation for them as they got a 20x improvement when they moved to personalized recommendations rather than pure popularity.
  3. The third and final theme was the importance of alternative infrastructure for data processing. This was not talked about too directly, other than the spark talk, but it was an undertone throughout. Most companies talked about building ETL pipelines using Hadoop and then building/training models using some other system that has more natural iterative programming abstractions than Hadoop.

View the entire article here!


FunInTheMoment Logo

MLconf 2013 Join us on Friday, November 15th in San Francisco for a full-day conference on Machine Learning. We’ll have Presentations from Intel, Netflix, Twitter, Pandora, Google, Graphlab, Yelp, and many more. Follow @MLconf for updates, discounts and free tickets! MLconf is a meeting place for both academia and industry to discuss upcoming challenges of large scale machine learning and solution methods. The conference will include presentations on Deep Learning, Recommender Systems, Collaborative Filtering and Matrix Factorization.

View the entire article here!

 

CrowdFlower Logo

The Machine Learning Conference (MLconf) is this Friday, Nov. 15, in San Francisco. MLconf is a meeting place for both academia and industry to discuss upcoming challenges of large-scale machine learning and solution methods. This year will highlight the exploding machine learning field and those at the forefront of its design and application. The conference will include topics spanning the worlds of entertainment, education, technology, communication and finance. CrowdFlower’s CEO, Lukas Biewald, is kicking off the day with a talk called, “Organizing Big Data with the Crowd,” that will discus how CrowdFlower is effectively leveraging the human intelligence of the crowd to solve problems related to collecting, categorizing and labeling vast amounts of data, resulting in better training of machine learning models and improved system performance. Other presenters include machine learning experts from Intel, Netflix, Pandora, Google, Twitter, Graphlab, and Stanford to name a few. Learn more about MLconf and register using this link to receive a special discount as a friend of CrowdFlower.

View the entire article here!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s