This paper describes Haystack, an object storage system optimized for Facebook's Photos application. Facebook currently stores over 260 billion images, which translates to over 20 petabytes of data. U...
This paper describes Haystack, an object storage system optimized for Facebook's Photos application. Facebook currently stores over 260 billion images, which translates to over 20 petabytes of data. U...
Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets.
To rapidly port speech applications to new languages one of the most difficult tasks is the initial collection of sufficient speech corpora.
Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Recent studies have found that while there are suggestive connections between topic models and the way humans interpret data, these two often disagree.
Geography and social relationships are inextricably intertwined; the people we interact with on a daily basis almost always live near us. As people spend more time online, data regarding these two dimensions -- geography and social relationships -- are becoming increasingly precise, allowing us to build reliable models to describe their interaction. These models have important implications in the design of location-based services, security intrusion detection, and social media supporting local communities.
We propose an approach to determine the ethnic break-down of a population based solely on people’s names and data provided by the U.S. Census Bureau. We demonstrate that our approach is able to predict the ethnicities of individuals as well as the ethnicity of an entire population better than natural alternatives.
Sharing a MapReduce cluster between users is attractive because it enables statistical multiplexing (lowering costs) and allows users to share a common large data set. However, we find that traditiona...
I analyze the use of emotion words for approximately 100 million Facebook users since September of 2007. “Gross national happiness” is operationalized as a standardized difference between the use of p...
Previous research has shown a relationship between use of social networking sites and feelings of social capital. However, most studies have relied on self-reports by college students. The goals of the current study are to (1) validate the common self-report scale using empirical data from Facebook, (2) test whether previous findings generalize to older and international populations, and (3) delve into the specific activities linked to feelings of social capital and loneliness.