Building a Scalable Follower Feed with Firestore

I have written several articles over the years about this subject, changed my thought process, saw other people's ideas, and changed my ideas again. Here I will cover everything you need to know in one article.

What is a Follower Feed? #

A follower feed is NOT something trivial in noSQL: show a list of the latest posts by User's you follow. While a regular feed can show the latest posts by all users, a follower feed will show you only the posts by the users you follow.

TL;DR #

While there are many techniques to display a follower feed in Firestore, they all have their limits. I have thought through every possibility, and it always comes back to aggregating your data with a Cloud Functions Trigger. This is called the fan-out method. Once you get to a certain amount of posts and users, you can use Cloud Functions to connect to other servers that are not serverless in order to beat the Cloud Functions time limit. However, reaching this limit is probably infeasible. You should also use this method following tags or for filtering any other data that would require a join in other data models.

Schema / Data Model #

Whether you're using a GraphDB, noSQL, or SQL, the data model is still generally the same. You're going to have posts, users, and follows.

	Posts
  id
  title
  authorId
  ...
Users
  id
  username
  ...
Followers
  follower_id
  following_id

This translates to noSQL like so:

	posts/{postId}
  id
  title
  authorId
  createdAt
  ...
users/{userId}
  username
  ...
followers/{followerId}
  following_id
  ...

Although, as you will see, this will not work from a querying perspective. You could also easily use subcollections instead of root collections for any of the collections, but the result is the same.

Queries #

Here are the queries that show what we're tying to achieve:

SQL 1 #

	SELECT * FROM Posts p WHERE authorId IN
(SELECT following FROM Followers WHERE follower = $UID) 
ORDER BY createdAt DESC

SQL 2 #

	SELECT * FROM Posts p
JOIN Followers f ON f.following = p.authorId
WHERE f.follower = $UID 
ORDER BY createdAt DESC

GraphQL #

	query {
  queryPost(
    where: { authorId: { follower: { id: $UID } } },
    order: { desc: createdAt }
  ) {
    id
    title
    createdAt
    ...
  }
}

So we really just end up with a many to many like so:

	Posts <- Followers -> Users

If we want to translate that to Firestore noSQL, we get something like this:

	const followersRef = db.collection('followers')
.where('follower', '==', $UID);

const following = (await followersRef.data()).following;

db.collection('posts').where('userId', 'IN', following);

But we are limited to following 30 people, and we are really doing two queries on the frontend, instead of a backend join.

Tag Feed #

We also run into the problem of other desired feeds, like following tags. We may want one feed with posts about the latest tags we follow, or a feed with posts from the latest users we follow, or BOTH. Then you get into weighted queries. If a post has a tag and a user we follow, should it be more important? As we have seen with other social media, we may want to artificially promote, or demote certain types of posts, or we may want to use AI to create more addictive feeds. These advanced types are beyond the scope of this post, and not really fit for aggregations etc.

Imperfect Versions #

So, let's see what we can do if we are not worried about scaling to millions of users.

Version 1 - Frontend Nightmare #

Do all query combining, indexing, etc on the frontend. This will cost you a lot and be slow. No thanks.

A sister version is to have a following array in users/{userID}. You can then grab just one document with all the users you have to follow, then grab their posts on the frontend. Better, but still over-reading.

Version 2 - Build Your Own Feed #

This is one of my ideas. Basically, when each user logs in, they update their feed on the spot. They will save the last updated date, and populate their feed in the background. This makes sense to me in certain circumstances, but is still not the best.

Version 3 - Fireship's Method #

This method was quite complex, but is definitely worth understanding.

Basically he has this data model:

	followers/{followerID}
  recentPosts: [
    ...5 recent posts here
  ],
  users: [
    user_following_ids
  ]
posts/{postId}
  ...post content here
users/{userId}
  ...user content here

With this query:

	const followedUsers = await db.collection('followers')
  .where('users', 'array-contains', followerID)
  .orderBy('lastPost', 'desc')
  .limit(10)
  .get();

You create a posts aggregation Firestore Trigger Function, to aggregate the latest 5 posts in recentPosts.

This works great, but then you have a limit on the possible followers you can have, due to using an array, and a limit on the number of posts you have on the frontend. You still need to sort all the latest posts. This is a great idea, but a hack none-the-less.

It is interesting to note here that he believes mass-duplication is unscalable if you do the math, due to the cost of mass duplication for someone with millions of users. He is right and wrong here.

Version 4 - Albert's Version #

This is the best hacked version I found from stackoverflow. It basically says store the posts like so:

	users/{userId}
  recentPosts: [
    ...1000 recent posts
  ],
  recentPostsLastUpdatedAt: Date
posts/{postId}
   ...post content here
following/{followerId}
   following: [
     ...users following
   ]

You aggregate up to 1000 documents on the user document in this version. Then it tells you to get all follower IDs in batches of 10 (30 OR clauses available now):

	query(usersRef, 
  where(‘userID’, 'in', [FOLLOWEE_ID_1, FOLLOWEE_ID_2, …]), 
  where("recentPostsLastUpdatedAt", ">", LAST_QUERIED_AT)
)

Once you get all users, you have all user posts, which is potentially 1 million posts for the price of 1000 reads. I like the thinking here, but still not for me. Again, too much frontend sorting, and over-complicated when you're just starting out.

Version 5 - My Crazy Aggregation Version #

So, I came up with a theoretical idea for a scalable version. It uses arrays to save money, but ultimately made no sense. Imagine using a 3 step aggregation to ultimately get a feed collection like this:

	feed/{postID}
  createdAt
  followers: [
     ....first 1000 followers
  ]

This gave you a neat query like so:

	db.collection('feed')
.where('followers', 'array-contains', userId)
.orderBy('createdAt', 'desc');

But ultimately, it was too unrealistic and unreliable. While arrays save money, they are limited and require more splitting.

Version 6 - Mass Aggregation Fan-out #

The biggest problem with mass aggregation is the limits of Firebase Functions; it could time out. However, it can be solved.

Imagine creating an onWrite function for the posts collection. This could trigger a callable function, say populateFollowerFeed(). This function could look like this:

	populateFollowerFeed({
  data: change.after.data(),
  startId: '0x12slsl2sls`,
  num: 20
});

Yes, you can call a function inside a function. This function would go through a follower collection (either subcollection, or a query within a root collection) to get all the user ids from followed users. It could add the created / updated posts to each user's feed collection.

This function would call itself again with the next startID, until there are no more follower ids. This prevents function timeouts. You should probably have it delete aggregated posts as well.

The beauty of this, is that you could have another callable function for populateTagsFeed(). This could be important if you want to mix and match your posts by followed tags as well.

Yes, this gets expensive for writes. However, it is simple, idempotent, and cheap for smaller to middle size databases. I disagree that this is unfeasible, as Firestore is specifically built for reads, not writes. All noSQLs are made to think this way. If you have 1,000,000 users, the costs should be minor compared to your real needs.

The Firebase Way #

Luckily, with the Firebase platform, you don't need all that. You could just use your Cloud Functions Trigger to offload the batching to another Google Platform like App Engine, Cloud Run, or Compute Engine with no or longer time limits.

However, in reality, a normal fan-out, aka just aggregating your data, is not going to surpass the 9 minute limit unless your database is extremely large! Don't worry about it. If you get that many hits, you should have the money to figure it out. This is the recommended way.

Ideas from Twitter #

Twitter does this Fan-out method to Redis: See Design twitter timeline. They actually manually grab the latest posts from users with large followers, instead of a doing a fan-out on everyone, creating a hybrid method. This could be 10 million or 100 million followers, as it is still pretty fast.

Other Databases #

Mass Aggregation may be the Firebase way. However, I suspect, considering they recommend Algolia for searching, that the Firebase team would recommend using an external database for your feeds.

However, keep in mind Algolia and other noSQL databases made for searching, cannot do the joins required for a simple follower feed.

One option may be to use Big Query with a Firebase Trigger. You could pretty much use any SQL database alongside Firestore, but that may be missing the point of Firestore.

Unless you have millions of users, Firestore should work just fine for your use case with basic aggregation techniques.

Firebase Tutorial