Quantifying Popularity in Real-Time for High-Volume Websites, Part 1

June 4, 2015 11:02 pm by P.I.E. Staff

Technology

This is the first half of a two-part blog post series. Part one covers the theory, part two is an implementation.

Because there is no accepted industry standard algorithm for determining popularity, content publishers can afford to get creative in their assessments. Sometimes, however, these algorithms can be trivially exploitable by spammers to deliver low-quality content to high-traffic areas of a website (e.g. the front page of the website).

What follows is Paragon Initiative Enterprise's user-driven popularity algorithm that is resilient against fraudulent voting. There is no patent for this algorithm; instead, we release it to the public domain. We hope that, after being refined and studied, it can be put to use for the public good.

What is Popularity Anyway?

Although your definition might differ, when we're discussing popularity we mean two things:

Popularity means it is well-liked. This means that on any given scale (0 to 5 stars, 0 to 10 out of 10, 0 to 100 points, etc.), popular items are ranked higher.
Popularity also means recency. Something that was well-liked 10 years ago might not be relevant compared to something scoring high this very minute.

In practice, this means any attempt to rank popularity must have two qualities:

The algorithm must, within reason, allow a community to judge what content is better than other content. From a security perspective, this means that an army of fake accounts should not be able to easily tilt the scale of apparent public opinion.
The algorithm must prioritize the new over the old.

Let's begin by designing an abuse-resistant algorithm for ranking content by a community's perception of its quality, then let's make one adjustment to fairly prioritize the latest and greatest over the tried and true.

To Judge Popularity, First Assess Quality

We'll start with an algorithm used by IMDB for their Top 250 movies, and it looks like this:

$$W = \dfrac{Rv + Cm}{v+m}$$

Where:

$R$ = the average score for a particular item
$v$ = the number of users who voted for a particular item
$C$ = the average of all the votes in the database
$m$ = the minimum number of votes to qualify
$W$ = the weighted rating (what we're solving for)

This has an interesting property: The Weighted rating ($W$) for a particular item is dragged closer to mediocrity ($C$) until it achieves a lot of votes that push it further away from the average.

In mathematics terms:

As $v$ approaches infinity, $W$ approaches $R$.
As $v$ approaches zero, $W$ approaches $C$.

The IMDB algorithm is quite powerful: In IMDB's case, it prevents new movies with 30,000 votes exclusively for 10 out of 10 from scoring higher than a movie with over 500,000 votes (most which are 10/10).

However, in absence of outside mitigations or manual intervention, this algorithm invites the possibility of automated fraudulent voting. All votes are treated equally, whether or not they are legitimate (although in IMDB's case, they only count active users' votes).

What if, instead, we allowed the community to self-select the users whose votes should count more?

The Karmatic Quality Formula

Popular news websites such as Hacker News and Reddit employ a karma system, which in simple terms tallies all of the upvotes and downvotes a user has received from their peers.

Since the karma for any given user is known, and the average karma for all active users is knowable, we can therefore weigh each user's votes by a simple ratio of the two:

$$ r_i = \dfrac{k_i}{\bar k} $$

Now that we have a weight for each user's votes, let's modify the IMDB Algorithm above to include karma ratios instead of a simple average.

$$ K = \frac{\sum_0^{v-1}{{s_i}{r_i}}}{\sum_0^{v-1}{r_i}} $$

$$ W = \dfrac{Kv + Cm}{v+m} $$

$s_i$ = a particular score from a particular vote
$r_i$ = the user's karma ratio
$K$ = weighted average
$v$ = the number of users who voted for a particular item
$C$ = the average of all the votes in the database
$m$ = the minimum number of votes to qualify
$W$ = the weighted rating (what we're solving for)

Quick Example

Let's say we have 10,000 active users with an average karma of 100. An attacker, who wants to push something terrible to the front page (e.g. to profit from ad impressions), controls 1,000 fake accounts (karma == 1) and has them all vote 10/10 on their spam submission.

50 legitimate average users (karma == 100) rate the spam submission 0/10.

What's the result of $K$? About 0.197 out of 10.

What is the result of $W$ if $ m = 100 $ and $ C = 6 $? About 5.178 out of 10.

Despite controlling 10% of the population, a handful of high-karma users (as decided by their peers) can effectively demote spam submissions just by giving it a low score.

Combined with CAPTCHAs and other spam-fighting solutions to make automated account registration more difficult, we can greatly reduce the potential impact of a successful spam campaign. But we still haven't quantified popularity.

An Algorithm for Popularity

The Karmatic Quality Formula can provide a reasonable estimate of the quality of some piece of content, but we're interested in what's good right this very second.

Our solution is straightforward: $K_p$ is similar to our calculation for $K$ above, except we also multiply each $ s_i k_i $ by one more term to add an exponential decay:

$$ K_p = \frac{\sum_0^{v-1}{{s_i}{r_i}{e^{D(t_i - t_{now})}}}}{\sum_0^{v-1}{r_i}} $$

$s_i$ = a particular score from a particular vote
$r_i$ = the user's karma ratio
$D$ = the magnitude of the exponential decay (a constant)
$t_{now}$ = the current moment in time (e.g. UNIX timestamp)
$t_i$ = the moment in time a particular vote was cast
$K_p$ = the karmatic rating with a decay, for the purpose of calculating popularity

Since $t_i$ will always be less than $t_{now}$, the result of $e^{D({t_i}-{t_now})}$ will always be less than or equal to 1.0.

Finally we can determine the popularity of a particular article, $P$:

$$ P = \dfrac{{K_p}{v} + Cm}{v+m} $$

$K_p$ = karma-weighted average, decayed for popularity
$v$ = the number of users who voted for a particular item
$C$ = the average of all the votes in the database
$m$ = the minimum number of votes to qualify
$P$ = the popularity score

Now we have a concise mathematical definition for popularity. In part two, we will implement this algorithm in PostgreSQL.

Permalink

License: Creative Commons Attribution-ShareAlike 4.0 International CC-BY-SA 4.0 Intl
View source (Markdown)

Paragon Initiative Enterprises Blog

The latest information from the team that develops cryptographically secure PHP software.

Wish Security Was Easier?