by Leon Rosenshein

A Scrolling We Will Go

Many of our datasets are big. Thousands or millions of rows. That's fine. We can search them, we can order them, we can get a subset. And getting a subset is good because UIs like small datasets so they can be responsive. Quick scrolling. Short data query times. And that leads to pagination, so users can scroll through the data without ever having to have the entire set available.

The simplest way to paginate is to use the equivalent of SQL's OFFSET and LIMIT. To get any particular page set OFFSET to pageSize * (pageNumber - 1) and LIMIT to pageSize and go. Simple. Easy to implement on the back end, Easy to add to a URL as a query parameter or a gRPC request. And for small datasets that are mostly read-only it works fine. But as you scale up and the write to  read ratio gets higher, not so much.

The first problem is simple scan time. If you want pages 19,000 and 19,001 of a 1,000,000 element resultset (assuming page size of 50) then each time you make the query you need to build the resultset, skip the first 950,000 results, then return something. That can take a bunch of time, and most of the time is spent dealing with information you're just going to drop on the floor since it's not part of the returned rows.

The second is the cost of the query. The more complicated your query, the longer it takes. If you're joining multiple tables the time can go exponential quickly. Exponentials and a large dataset are a bad combination if you're trying to be fast.

The third problem comes from changes to the datasource. If you've got 200 entries in your table and you want to get page 4 then that would be entries 151-200. Simple. But what happens if the first 10 entries get deleted between the time you ask for page 3 and page 4? Now you get entries 151-190, which isn't a problem, but, you never returned the original elements 151-160, since now they're elements 141-150. And you will have the inverse problem if things get added before the current page. Instead of missing entries you'll get the same entries at the beginning on the next page.

So what's a poor developer to do? First and foremost, indices are your friends. If you've got a natural key in your data, or you can create one from multiple columns, make an index and use it. Then you can use that for your pagination. Instead of using the OFFSET of the last row, use the last key value. Then you're doing a lookup on the index (much faster) and you know you're starting with the last row, regardless of any changes to data.

Another option, if it fits your use case, is to build a temporary table with the results of your query, give it a simple key, like row number, and index on it. Then just return pages from the temporary table. This one is a little more complicated since you need to somehow associate the query with the temp table and manage the table's lifetime. If your query is complicated/takes a long time, results in a big table, and can stand to be a little out of date, this can turn a slow UI into one that's lightning fast.