Storage

Revision as of 02:36, 29 August 2024 by Jonny (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Up to: Tech WG

so Mastodon takes up space, right? How do we manage that?

Currently we just have all our storage on the same Linode compute instance that we run the mastodon instance from. That should probably change at some point to using S3-compatible object storage.

But for now, what the heck takes up all the space?

What Gets Stored

Database

The database is relatively small since it mostly stores text! For reference, as of August 15 (roughly 8 months after the instance started), it is 13.5GB, with statuses maching up 4.5GB of that.

Media Storage

The instance is currently configured to remove media (images, video, audio) from other instances from the cache after 1 day - after that, the media is redownloaded from the hosting instance when accessed.

Of course, we host all media from our members indefinitely, and actively encourage them to post big images and videos because that fills up their stat counters

 

Cache

Data from other instances that isn't status and other text data is stored in the system cache: mastodon/live/public/system/cache - this currently makes up most of the storage needs.

It includes

  • account data, including avatars and header images, which are stored indefinitely and thus need to be pruned
  • media from other instances (images, videos in posts)
  • custom emojis from other instances
  • link preview cards for embedded links

Object Storage

Our media is configured to be served via a reverse proxy at https://media.neuromatch.social

Proxy

First we set up the reverse proxy as described in the linode and masto docs - we modified the Nginx configuration slightly to use a different cache than the regular masto cache, and because you can't declare a proxy_cache_path within the server block as the linode docs have.

The linode docs are also incorrect in their spec of the proxy_set_header Host directive. So our full config is below:

proxy_cache_path /var/cache/nginx-object-storage keys_zone=CACHEOBJECT:10m inactive=7d max_size=10g;

server {
  listen 443 ssl http2;
  listen [::]:443 ssl http2;
  server_name media.neuromatch.social;
  root /var/www/html;
    ssl_certificate /etc/letsencrypt/live/media.neuromatch.social/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/media.neuromatch.social/privkey.pem; # managed by Certbot

  keepalive_timeout 30;

  location = / {
    index index.html;
  }

  location / {
    try_files $uri @s3;
  }

  set $s3_backend 'https://neuromatchstodon.us-east-1.linodeobjects.com';

  location @s3 {
    limit_except GET {
      deny all;
    }

    resolver 8.8.8.8;
    proxy_set_header Host neuromatchstodon.us-east-1.linodeobjects.com;
    proxy_set_header Connection '';
    proxy_set_header Authorization '';
    proxy_hide_header Set-Cookie;
    proxy_hide_header 'Access-Control-Allow-Origin';
    proxy_hide_header 'Access-Control-Allow-Methods';
    proxy_hide_header 'Access-Control-Allow-Headers';
    proxy_hide_header x-amz-id-2;
    proxy_hide_header x-amz-request-id;
    proxy_hide_header x-amz-meta-server-side-encryption;
    proxy_hide_header x-amz-server-side-encryption;
    proxy_hide_header x-amz-bucket-region;
    proxy_hide_header x-amzn-requestid;
    proxy_ignore_headers Set-Cookie;
    proxy_pass $s3_backend$uri;
    proxy_intercept_errors off;

    proxy_cache CACHEOBJECT;
    proxy_cache_valid 200 48h;
    proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
    proxy_cache_lock on;

    expires 1y;
    add_header Cache-Control public;
    add_header 'Access-Control-Allow-Origin' '*';
    add_header X-Cache-Status $upstream_cache_status;
    add_header X-Content-Type-Options nosniff;
    add_header Content-Security-Policy "default-src 'none'; form-action 'none'";
  }

}

Configuration

Like with nginx, linode's docs are slightly wrong, our config looks like:

S3_ENABLED=true
S3_PROTOCOL=https
S3_BUCKET=neuromatchstodon
S3_REGION=us-east-1
S3_HOSTNAME=neuromatchstodon.us-east-1.linodeobjects.com
S3_ENDPOINT=https://us-east-1.linodeobjects.com/
AWS_ACCESS_KEY_ID={{ OUR KEY ID }}
AWS_SECRET_ACCESS_KEY={{ OUR SECRET KEY }}
S3_ALIAS_HOST=media.neuromatch.social

Until you get your existing files transferred to the bucket, keep S3_ENABLED=false.

Transition

To transition data from our existing instance, we used rclone.

docs:

Install:

curl -O https://downloads.rclone.org/rclone-current-linux-amd64.zip
unzip rclone-current-linux-amd64.zip
cd rclone-*-linux-amd64

# Copy binary file

sudo cp rclone /usr/bin/
sudo chown root:root /usr/bin/rclone
sudo chmod 755 /usr/bin/rclone

# Install manpage

sudo mkdir -p /usr/local/share/man/man1
sudo cp rclone.1 /usr/local/share/man/man1/
sudo mandb

Then configure with rclone config

  • (n)ew remote
  • name: neuromatchstodon
  • keys happen
  • Select s3, configure for current bucket (see linode dashboard)


HUGE WASTE OF TIME WARNING: the default ACL for rclone is private, but mastodon requires objects to be public. You have to set a global config in rclone.config like

[spaces]
acl = public-read

before you copy/sync. If you do manage to copy all your files in private mode, you can use s3cmd afterwards to change all their ACLs like

s3cmd setacl s3://bucketname --acl-public --recursive

Now copy the data to the bucket!

cd /home/mastodon/live
# rclone copy -P public/system {rclone_label}:{bucket_name}
rclone copy -P public/system neuromatchstodon:neuromatchstodon

Enabling

After all your objects are transferred, you should

  • change S3_ENABLED=true in your .env.production
  • restart mastodon services like sudo systemctl restart mastodon-*

Make a test post to confirm things are working, and you should be good to go <3

Managing Storage

Media

Media storage is configured from the Mastodon admin interface - https://neuromatch.social/admin/settings/content_retention

Warning: Do not set the "Content cache retention period" as it will remove posts and bosts from other servers! that includes removing the bookmarks, favorites, and boosts of our members! bad to do!

Cache Pruning

See Maintenance#Cache Pruning

We use tootctl to periodically prune data from the server by running these commands every month. At the moment this keeps our storage in a sustainable range:

# remote cover images from accounts that nobody on the instance is following
tootctl media remove --remove-headers

We might also want to add these commands that make sense to do, but dont' necessarily contribute a ton to our storage burden. Want to check with the rest of Tech WG first before I do these (-Jonny 23-08-15)

# remove remote accounts that no longer exist
# excludes accounts with confirmed activity in the last week in case the server is down
tootctl accounts cull

# remove files that do not belong to media attachments.
tootctl media remove-orphans

Logs

Logs end up being a surprisingly large contributor to storage burden!

Systemd

Most of the large logs are managed by systemd/journald. To prevent these from growing out of control, we set a global maximum disk usage of 5G, and then set a relatively small cap for each individual file so that they are rotated/archived frequently. We additionally delete all log files older than one month, if they manage to escape these caps.

In /etc/systemd/journald.conf

[Journal]
Storage=auto
Compress=yes
SystemMaxUse=5G
SystemMaxFileSize=64M
SystemMaxFiles=10
MaxRetentionSec=1month

After editing this configuration, reload it:

sudo systemctl restart systemd-journald

Nginx

See Wiki/Nginx#Log_Rotation

Reference