Up to: Tech WG
so Mastodon takes up space, right? How do we manage that?
Currently we just have all our storage on the same Linode compute instance that we run the mastodon instance from. That should probably change at some point to using S3-compatible object storage.
But for now, what the heck takes up all the space?
What Gets Stored
Database
The database is relatively small since it mostly stores text! For reference, as of August 15 (roughly 8 months after the instance started), it is 13.5GB, with statuses maching up 4.5GB of that.
Media Storage
The instance is currently configured to remove media (images, video, audio) from other instances from the cache after 1 day - after that, the media is redownloaded from the hosting instance when accessed.
Of course, we host all media from our members indefinitely, and actively encourage them to post big images and videos because that fills up their stat counters
Cache
Data from other instances that isn't status and other text data is stored in the system cache: mastodon/live/public/system/cache
- this currently makes up most of the storage needs.
It includes
- account data, including avatars and header images, which are stored indefinitely and thus need to be pruned
- media from other instances (images, videos in posts)
- custom emojis from other instances
- link preview cards for embedded links
Object Storage
- Masto Docs:
- Linode Docs:
Our media is configured to be served via a reverse proxy at https://media.neuromatch.social
Proxy
First we set up the reverse proxy as described in the linode and masto docs - we modified the Nginx configuration slightly to use a different cache than the regular masto cache, and because you can't declare a proxy_cache_path
within the server block as the linode docs have.
The linode docs are also incorrect in their spec of the proxy_set_header Host
directive. So our full config is below:
proxy_cache_path /var/cache/nginx-object-storage keys_zone=CACHEOBJECT:10m inactive=7d max_size=10g;
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name media.neuromatch.social;
root /var/www/html;
ssl_certificate /etc/letsencrypt/live/media.neuromatch.social/fullchain.pem; # managed by Certbot
ssl_certificate_key /etc/letsencrypt/live/media.neuromatch.social/privkey.pem; # managed by Certbot
keepalive_timeout 30;
location = / {
index index.html;
}
location / {
try_files $uri @s3;
}
set $s3_backend 'https://neuromatchstodon.us-east-1.linodeobjects.com';
location @s3 {
limit_except GET {
deny all;
}
resolver 8.8.8.8;
proxy_set_header Host neuromatchstodon.us-east-1.linodeobjects.com;
proxy_set_header Connection '';
proxy_set_header Authorization '';
proxy_hide_header Set-Cookie;
proxy_hide_header 'Access-Control-Allow-Origin';
proxy_hide_header 'Access-Control-Allow-Methods';
proxy_hide_header 'Access-Control-Allow-Headers';
proxy_hide_header x-amz-id-2;
proxy_hide_header x-amz-request-id;
proxy_hide_header x-amz-meta-server-side-encryption;
proxy_hide_header x-amz-server-side-encryption;
proxy_hide_header x-amz-bucket-region;
proxy_hide_header x-amzn-requestid;
proxy_ignore_headers Set-Cookie;
proxy_pass $s3_backend$uri;
proxy_intercept_errors off;
proxy_cache CACHEOBJECT;
proxy_cache_valid 200 48h;
proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504;
proxy_cache_lock on;
expires 1y;
add_header Cache-Control public;
add_header 'Access-Control-Allow-Origin' '*';
add_header X-Cache-Status $upstream_cache_status;
add_header X-Content-Type-Options nosniff;
add_header Content-Security-Policy "default-src 'none'; form-action 'none'";
}
}
Configuration
Like with nginx, linode's docs are slightly wrong, our config looks like:
S3_ENABLED=true
S3_PROTOCOL=https
S3_BUCKET=neuromatchstodon
S3_REGION=us-east-1
S3_HOSTNAME=neuromatchstodon.us-east-1.linodeobjects.com
S3_ENDPOINT=https://us-east-1.linodeobjects.com/
AWS_ACCESS_KEY_ID={{ OUR KEY ID }}
AWS_SECRET_ACCESS_KEY={{ OUR SECRET KEY }}
S3_ALIAS_HOST=media.neuromatch.social
Until you get your existing files transferred to the bucket, keep S3_ENABLED=false
.
Transition
To transition data from our existing instance, we used rclone.
docs:
- https://www.linode.com/docs/guides/rclone-object-storage-file-sync/
- install rclone from precompiled binary: https://rclone.org/install/#linux-precompiled
Install:
curl -O https://downloads.rclone.org/rclone-current-linux-amd64.zip
unzip rclone-current-linux-amd64.zip
cd rclone-*-linux-amd64
# Copy binary file
sudo cp rclone /usr/bin/
sudo chown root:root /usr/bin/rclone
sudo chmod 755 /usr/bin/rclone
# Install manpage
sudo mkdir -p /usr/local/share/man/man1
sudo cp rclone.1 /usr/local/share/man/man1/
sudo mandb
Then configure with rclone config
- (n)ew remote
- name:
neuromatchstodon
- keys happen
- Select s3, configure for current bucket (see linode dashboard)
HUGE WASTE OF TIME WARNING: the default ACL for rclone is private
, but mastodon requires objects to be public. You have to set a global config in rclone.config
like
[spaces] acl = public-read
before you copy/sync. If you do manage to copy all your files in private mode, you can use s3cmd afterwards to change all their ACLs like
s3cmd setacl s3://bucketname --acl-public --recursive
Now copy the data to the bucket!
cd /home/mastodon/live
# rclone copy -P public/system {rclone_label}:{bucket_name}
rclone copy -P public/system neuromatchstodon:neuromatchstodon
Enabling
After all your objects are transferred, you should
- change
S3_ENABLED=true
in your.env.production
- restart mastodon services like
sudo systemctl restart mastodon-*
Make a test post to confirm things are working, and you should be good to go <3
Managing Storage
Media
Media storage is configured from the Mastodon admin interface - https://neuromatch.social/admin/settings/content_retention
Warning: Do not set the "Content cache retention period" as it will remove posts and bosts from other servers! that includes removing the bookmarks, favorites, and boosts of our members! bad to do!
Cache Pruning
We use tootctl to periodically prune data from the server by running these commands every month. At the moment this keeps our storage in a sustainable range:
# remote cover images from accounts that nobody on the instance is following
tootctl media remove --remove-headers
We might also want to add these commands that make sense to do, but dont' necessarily contribute a ton to our storage burden. Want to check with the rest of Tech WG first before I do these (-Jonny 23-08-15)
# remove remote accounts that no longer exist
# excludes accounts with confirmed activity in the last week in case the server is down
tootctl accounts cull
# remove files that do not belong to media attachments.
tootctl media remove-orphans
Logs
Logs end up being a surprisingly large contributor to storage burden!
Systemd
Most of the large logs are managed by systemd/journald. To prevent these from growing out of control, we set a global maximum disk usage of 5G, and then set a relatively small cap for each individual file so that they are rotated/archived frequently. We additionally delete all log files older than one month, if they manage to escape these caps.
In /etc/systemd/journald.conf
[Journal] Storage=auto Compress=yes SystemMaxUse=5G SystemMaxFileSize=64M SystemMaxFiles=10 MaxRetentionSec=1month
After editing this configuration, reload it:
sudo systemctl restart systemd-journald
Nginx
Reference
- Helpful info on managing disk usage with tootctl - https://ricard.dev/improving-mastodons-disk-usage/