1,035
edits
No edit summary |
No edit summary |
||
(7 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
|Contributors=Jonny | |Contributors=Jonny | ||
|Has Git Repository=https://github.com/NeuromatchAcademy/mastodon | |Has Git Repository=https://github.com/NeuromatchAcademy/mastodon | ||
|Completion Status= | |Completion Status=Completed | ||
|Active Status=Inactive | |Active Status=Inactive | ||
}} | }} | ||
* Pull Request: [[Has Pull Request::https://github.com/NeuromatchAcademy/mastodon/pull/36]] | |||
* Bugfix - respect local vs remote scope: [[Has Pull Request::https://github.com/NeuromatchAcademy/mastodon/pull/38]] | |||
== Problem == | == Problem == | ||
Line 75: | Line 78: | ||
true | true | ||
end | end | ||
</syntaxhighlight> | |||
== Options == | |||
* Model - Make some tricky scope that can filter repeated boosts in the existing public_feed model | |||
* Controller - Filter repeated boosts from the cached feed when serving the public_feed endpoint | |||
* Frontend - Filter repeated boosts in the web UI (but wouldn't work on apps, which is bad) | |||
== Implementation == | |||
=== Simpler Implementation (good) === | |||
Instead of doing all that garbage below, we can just restate the problem as "we want posts that aren't boosts, or posts that are boosts with the maximum ID for all boosts of a given post" | |||
Like this: | |||
<syntaxhighlight lang="ruby"> | |||
def max_boost_id_scope | |||
Status.where(<<~SQL.squish) | |||
"statuses"."id" = ( | |||
SELECT MAX(id) | |||
FROM "statuses" "s2" | |||
WHERE "s2"."reblog_of_id" = "statuses"."reblog_of_id" | |||
#{'AND ("s2"."local" = true OR "s2"."uri" IS NULL)' if local_only?} | |||
#{'AND "s2"."local" = false AND "s2"."uri" IS NOT NULL' if remote_only?} | |||
) | |||
SQL | |||
end | |||
def without_duplicate_reblogs | |||
Status.where(statuses: { reblog_of_id: nil }) | |||
.or(max_boost_id_scope) | |||
end | |||
</syntaxhighlight> | |||
==== Caveats ==== | |||
Note that we are using string interpolations of the <code>Status.local</code> and <code>Status.remote</code> scopes because there isn't a way, as far as [[Jonny]] can tell, to change the table alias of a scope. This shouldn't be that big of an issue (if, eg. the definition of those scopes changes) because the tests should catch most cases. | |||
==== Example Query ==== | |||
<syntaxhighlight lang="sql"> | |||
SELECT "statuses"."id", | |||
"statuses"."updated_at" | |||
FROM "statuses" | |||
INNER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" | |||
WHERE "statuses"."visibility" = $1 | |||
AND "accounts"."suspended_at" IS NULL | |||
AND "accounts"."silenced_at" IS NULL | |||
AND (statuses.reply = FALSE | |||
OR statuses.in_reply_to_account_id = statuses.account_id) | |||
AND ("statuses"."reblog_of_id" IS NULL | |||
OR "statuses"."id" = | |||
(SELECT MAX(id) | |||
FROM statuses s2 | |||
WHERE s2.reblog_of_id = statuses.reblog_of_id | |||
AND ("s2"."local" = true OR "s2"."uri" IS NULL) | |||
)) | |||
AND ("statuses"."local" = $2 | |||
OR "statuses"."uri" IS NULL) | |||
AND "statuses"."deleted_at" IS NULL | |||
AND 1=1 | |||
AND "statuses"."id" < 111813463418866657 | |||
ORDER BY "statuses"."id" DESC LIMIT $3 [["visibility", 0], ["local", true], ["LIMIT", 20]] | |||
</syntaxhighlight> | |||
=== Original Implementation (bad) === | |||
Actually pretty damn simple. Add an additional scope in <code>public_feed.rb</code> | |||
The core of the scope is this: | |||
<syntaxhighlight lang="ruby"> | |||
def without_duplicate_reblogs(limit, max_id, since_id, min_id) | |||
inner_query = Status.select('DISTINCT ON (reblog_of_id) statuses.id') | |||
.reorder(reblog_of_id: :desc, id: :desc) | |||
Status.where(statuses: { reblog_of_id: nil }) | |||
.or(Status.where(id: inner_query)) | |||
</syntaxhighlight> | |||
which selects either | |||
* Statuses that aren't a boost (<code>reblog_of_id == nil</code>) | |||
* Statuses that are a boost | |||
** sorted by the boosted status ID and the ID of the boost itself, | |||
** using postgres' [DISTINCT ON https://www.postgresql.org/docs/current/sql-select.html] clause to select only the first matching row | |||
** since <code>id</code> is a snowflake ID, and thus chronological, this will be the most recent boost. | |||
There are some problems with this naive implementation though: | |||
* The inner query will select all boosts from all time, every time, because the outer pagination parameters passed to <code>PublicFeed.get</code> don't propagate to the inner query | |||
* The inner query necessarily needs to sort by <code>reblog_of_id</code> first, so just applying a LIMIT will filter on the recency of the ''original post'' rather than the ''boost,'' meaning we will miss most boosts most of the time | |||
So we add the pagination information from <code>PublicFeed.get</code>, mimicking <code>Paginable.to_a_paginated_by_id</code>'s use of the parameters to make a WHERE filter selecting whatever statuses would also be included in the given page that is being fetched | |||
<syntaxhighlight lang="ruby"> | |||
def without_duplicate_reblogs(limit, max_id, since_id, min_id) | |||
inner_query = Status.select('DISTINCT ON (reblog_of_id) statuses.id').reorder(reblog_of_id: :desc, id: :desc) | |||
if min_id.present? | |||
inner_query = inner_query.where(min_id < :id) | |||
elsif since_id.present? | |||
inner_query = inner_query.where(since_id < :id) | |||
end | |||
inner_query = inner_query.where(max_id > :id) if max_id.present? | |||
inner_query = inner_query.limit(limit) if limit.present? | |||
Status.where(statuses: { reblog_of_id: nil }) | |||
.or(Status.where(id: inner_query)) | |||
end | |||
</syntaxhighlight> | |||
But that still doesn't quite get us there, since the first page of a local feed load doesn't have any pagination information in it except for the limit, so we still have the same problem as before. So instead we create a set of <code>n</code> candidate statuses where <code>n</code> is some arbitrary multiple of limit such that we assume that whatever is being filtered out by the other scopes is less than <code>n</code> - ie. we are considering the same set of candidate statuses as the outer scope. This should be relatively fast because we're just taking a slice of the index, rather than applying some complex SQL shit. | |||
Recall that this is still one of several scopes ANDed together, so we will not return any boosts that are filtered by the other scopes, even if they are present in this scope. | |||
<syntaxhighlight lang="ruby"> | |||
def without_duplicate_reblogs(limit, max_id, since_id, min_id) | |||
candidate_statuses = Status.select(:id).reorder(id: :desc) | |||
if min_id.present? | |||
candidate_statuses = candidate_statuses.where(min_id < :id) | |||
elsif since_id.present? | |||
candidate_statuses = candidate_statuses.where(since_id < :id) | |||
end | |||
candidate_statuses = candidate_statuses.where(max_id > :id) if max_id.present? | |||
if limit.present? | |||
limit *= 5 | |||
candidate_statuses = candidate_statuses.limit(limit) | |||
end | |||
inner_query = Status | |||
.where(id: candidate_statuses) | |||
.select('DISTINCT ON (reblog_of_id) statuses.id') | |||
.reorder(reblog_of_id: :desc, id: :desc) | |||
Status.where(statuses: { reblog_of_id: nil }) | |||
.or(Status.where(id: inner_query)) | |||
end | |||
</syntaxhighlight> | |||
But one MORE problem - since we're just considering one page at a time, we will still get duplicated boosts across pages. So... | |||
* When no minimum ID is provided, we use the "multiply the limit" strategy to avoid querying all statuses from all time | |||
* When a minimum ID is provided, we | |||
** Use no limit | |||
** If a maximum ID is also provided, we add 1 day worth of time to the ID so we also don't query arbitrarily into the future when fetching past pages | |||
<syntaxhighlight lang="ruby"> | |||
def without_duplicate_reblogs(limit, max_id, since_id, min_id) | |||
candidate_statuses = Status.select(:id).reorder(id: :desc) | |||
if min_id.present? | |||
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].gt(min_id)) | |||
elsif since_id.present? | |||
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].gt(since_id)) | |||
elsif limit.present? | |||
limit *= 5 | |||
candidate_statuses = candidate_statuses.limit(limit) | |||
end | |||
if max_id.present? | |||
max_time = Mastodon::Snowflake.to_time(id) | |||
max_time += 1.day | |||
max_id = Mastodon::Snowflake.id_at(max_time) | |||
candidate_statuses = candidate_statuses.where(Status.arel_table[:id].lt(max_id)) | |||
end | |||
inner_query = Status | |||
.where(id: candidate_statuses) | |||
.select('DISTINCT ON (reblog_of_id) statuses.id') | |||
.reorder(reblog_of_id: :desc, id: :desc) | |||
Status.where(statuses: { reblog_of_id: nil }) | |||
.or(Status.where(id: inner_query)) | |||
end | |||
</syntaxhighlight> | |||
==== Example Query ==== | |||
<syntaxhighlight lang="sql"> | |||
SELECT "statuses"."id", "statuses"."updated_at" | |||
FROM "statuses" | |||
INNER JOIN "accounts" ON "accounts"."id" = "statuses"."account_id" | |||
WHERE "statuses"."visibility" = $1 | |||
AND "accounts"."suspended_at" IS NULL | |||
AND "accounts"."silenced_at" IS NULL | |||
AND ( | |||
statuses.reply = FALSE | |||
OR statuses.in_reply_to_account_id = statuses.account_id | |||
) | |||
AND ( | |||
"statuses"."reblog_of_id" IS NULL | |||
OR "statuses"."id" IN ( | |||
SELECT DISTINCT ON (reblog_of_id) statuses.id | |||
FROM "statuses" | |||
WHERE "statuses"."deleted_at" IS NULL | |||
AND "statuses"."id" IN ( | |||
SELECT "statuses"."id" FROM "statuses" | |||
WHERE "statuses"."deleted_at" IS NULL | |||
AND "statuses"."id" < 111819125737828654 | |||
ORDER BY "statuses"."id" DESC | |||
LIMIT $2 | |||
) | |||
ORDER BY "statuses"."reblog_of_id" DESC, "statuses"."id" DESC | |||
) | |||
) | |||
AND ( | |||
"statuses"."local" = $3 | |||
OR "statuses"."uri" IS NULL | |||
) | |||
AND "statuses"."deleted_at" IS NULL | |||
AND 1=1 | |||
AND "statuses"."id" < 111813463418866657 | |||
ORDER BY "statuses"."id" DESC LIMIT $4 | |||
[["visibility", 0], ["LIMIT", 100], ["local", true], ["LIMIT", 20]] | |||
</syntaxhighlight> | </syntaxhighlight> | ||