Applying Exponential Family Embeddings in Natural Language Processing to Analyze Text

Abstract

Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications. In this talk, Maryam will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. She will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). Maryam will discuss how we implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last 3 years. Maryam will specifically focus the description of results on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time. Key takeaways: (1) Lessons learnt from implementing different word embedding methods (from pertained to custom); (2) How to map trends from a combination of natural language and structured data; (3) How data science skills have varied across industries, functions and over time.

Date
Location
New York NY