使用 tunasync 搭建自己的镜像站

同时也投了 bilibili 专栏

月初从 Houge 那里收来一台双路 E5 v2 ，尽管在群友那里属于性能淘汰的机器，对笔者来说是拥有的最强的机器了。拿到手就装了 Arch Linux 。

那 Arch Linux 是一个滚动发行版，若在维护的时候想安装一个新的包就意味着一次 Sy ，按照道理应该需要 Syuu 因为部分更新是不推荐的，那么全量更新以后就推荐重启。

毕竟笔者是远程 ssh 到服务器的，重启服务器是一个讨厌的工作。尽管有正确配置服务器的自信，但是笔者也担心未知的软件包变更导致失联。

一个非常合理的做法就是对当前软件源镜像做一个 snapshot ；一个更合理的做法是为啥不干脆搭一个本地镜像站。问了问同事，他给出了一样的建议。

又想起来大学时期隔壁兴趣小组搭建校内镜像站的时候，使用 tunasync 搭建失败，这就激起笔者的好奇心：作为国内最伟大和最先进的高校镜像站，它开源出来的工具到底有什么样的优越性，复杂程度到底如何。

现在先抛出一个结论，尽管 tunasync 具有比较多的高级调度功能，首次接触这个工具依然可以在几个小时之内搭建一个能用的镜像站和 web 页面；而服务器监控部分以及 web 页面的详细配置则需要更多的时间。

在本文所载的 tunasync 配置过程中，一定存在错误以及可以改进的地方；给出的脚本能用但一定不是最优解。请各位大佬批评指正。

系统环境描述

尽管服务器的操作系统为 Arch Linux ，为了方便服务的隔离和维护， tunasync 被安置在一个 Debian bookworm 的 lxc 非特权容器中。由于 lxc 容器可以依靠 cgroup v2 的特性对容器进行资源限制，故我也没有用到 tunasync 的 cgroup 相关功能。

OS: Arch Linux x86_64
Kernel: 6.5.2-arch1-1
LXC: 5.0.3
LXC Distro: Debian GNU/Linux 12 (bookworm)
cgroup2.memroy.max: 10G
cgroup2.memory.high: 2G

可以看到整个容器会将 RAM 资源占用压缩在 2G 左右，故整个服务理论上对资源要求并不高。

关于 LXC 的配置以及非特权容器的建立可以参考 ArchWiki 相关页面，笔者撰写这篇文章时它还在翻译中。传送门

后面的配置均在 LXC 容器的 Debian 中完成

构建 tunasync

尽管 tunasync 的 github release 提供了预编译软件包，自己构建二进制还是有一定意义的。注意 lxc Debian 容器的默认镜像源并不十分适合国内使用，请依照自己的喜好更正。

sudo apt-get install golang git build-essential

export https_proxy=http://xxxx.xxx:xxxx

git clone https://github.com/tuna/tunasync.git

cd tunasync/
make

在 build-linux-amd64 目录下可以看到两个构建成功的二进制 tunasync 和 tunasynctl 。其中 tunasync 是整个镜像站同步的核心，而 tunasynctl 从名字也能看出来是一个控制工具。

验证二进制可用后，安装之

cd build-linux-amd64/
sudo cp tunasync* /usr/local/bin/

配置 tunasync

tunasync 可以以 worker 和 manager 两种模式运行。 worker 执行镜像站同步，而 manager 则是 tunasynctl 的接口。

一些具体的解释可以查看 tunasync 的文档，而这里给出几个参考配置。

cat /etc/tunasync/ctl.conf

manager_addr = "127.0.0.1"
manager_port = 21020
ca_cert = ""

cat /etc/tunasync/manager.conf

debug = false

[server]
addr = "127.0.0.1"
port = 21020
ssl_cert = ""
ssl_key = ""

[files]
db_type = "bolt"
db_file = "/path/to/your/tunasync_log/db/manager.db"
ca_cert = ""

cat /etc/tunasync/worker.conf

[global]
name = "mirror_worker"
log_dir = "/path/to/your/tunasync_log/{{.Name}}"
mirror_dir = "/path/to/your/mirror/tunasync"
concurrent = 2
interval = 360

[manager]
api_base = "http://localhost:21020"
token = ""
ca_cert = ""

[cgroup]
enable = false
base_path = "/sys/fs/cgroup"
group = "tunasync"

[server]
hostname = "localhost"
listen_addr = "127.0.0.1"
listen_port = 21021
ssl_cert = ""
ssl_key = ""

[[mirrors]]
name = "archlinux"
provider = "rsync"
upstream = "rsync://mirrors.bfsu.edu.cn/archlinux/"

[[mirrors]]
name = "archlinuxcn"
provider = "rsync"
upstream = "rsync://mirrors.bfsu.edu.cn/archlinuxcn/"

使用 rsync 进行同步需要安装 rsync 软件包

sudo apt-get install rsync

tunasync 仓库给出了两个 systemd 配置，由于我们没有使用 tunasync 自带的 cgroup 特性，需要进行小小的改动

cat /usr/lib/systemd/system/tunasync-worker.service

[Unit]
Description = TUNA mirrors sync worker
After=network.target

[Service]
Type=simple
User=tunasync
PermissionsStartOnly=true
#ExecStartPre=/usr/bin/cgcreate -t tunasync -a tunasync -g memory:tunasync
ExecStart=/usr/local/bin/tunasync worker -c /etc/tunasync/worker.conf --with-systemd
ExecReload=/bin/kill -SIGHUP $MAINPID
#ExecStopPost=/usr/bin/cgdelete memory:tunasync

[Install]
WantedBy=multi-user.target

cat /usr/lib/systemd/system/tunasync-manager.service

[Unit]
Description = TUNA mirrors sync manager
After=network.target
Requires=network.target

[Service]
Type=simple
User=tunasync
ExecStart = /usr/local/bin/tunasync manager -c /etc/tunasync/manager.conf --with-systemd

[Install]
WantedBy=multi-user.target

启用这两个 systemd daemon ，如果一切顺利，此时应该已经开始同步了

sudo systemctl enable --now tunasync-worker
sudo systemctl enable --now tunasync-manager
tunasynctl list --all

构建 tunasync web

镜像站的 web 页面也是开源的，理论上我们可以建立和清华源一毛一样的镜像站，但是在实际操作中请遵循 README.md 中提到的规范

git clone https://github.com/tuna/mirror-web.git

sudo apt-get install ruby ruby-dev ruby-bundler

curl -fsSL "https://cdn.jsdelivr.net/gh/tj/n@7.3.0/bin/n#v14.17.1" | sudo  bash -s -- lts

cd mirror-web

bundle install
sudo sed -i 's/@context ||= ExecJS.compile("var self = this; " + File.read(script_path))/@context ||= ExecJS.compile("var self = this; " + File.read(script_path, :encoding => "UTF-8"))/' /var/lib/gems/3.1.0/gems/babel-transpiler-0.7.0/lib/babel/transpiler.rb

export LANG=en_US.UTF-8
jekyll build

标题等内容可以配置 _config.yml 来改变，生成的网页在 _site 目录下。

配置 tunasync web

为了简化操作，笔者直接把编译生成的网页复制到镜像源的根目录，并使用 nginx 服务

sudo apt-get install python3 nginx libnginx-mod-http-fancyindex

cd _site
cp * /path/to/your/mirror/tunasync
mkdir -p /path/to/your/mirror/tunasync/static/status

为了网页正常运行，还需要生成 tunasync.json 、 isoinfo.json 和 disk.json，这是指示镜像站状态的动态文件。其中 isoinfo.json 由 geninfo/genisolist.py 脚本生成，我们只需要写一个脚本定时执行即可。

注意 genisolist.py 和 genisolist.ini 必须放在同一个目录

cat /etc/tunasync/genisolist/genisolist.ini
# This file is the config required by genisolist.py

# This special section named "%main%" defined following variables:
# "root": HTTP root of mirrors. The script will locate the images in it.
# "urlbase": URL of mirrors prepended to image path. We want to use relative
#            path, so set it to '/' instead of complete URL.
# "d[N]": For distribution sorting. where N is an positive integer. The value
#         is disto name specified in below sections. Lower N makes the distro
#         show higher. Default N is 0xFFFF for distro not mentioned.
[%main%]
root = /path/to/your/mirror/tunasync/
urlbase = /
d10 = Arch Linux

# Sections whose name isn't "%main%" defined a detect rule of image detection.
[archlinux]
# Section name is of no use, the display name is specified in "distro" option.
distro = Arch Linux
# listvers defined how many latest versions to display.
listvers = 1
# "location" specifies globbing pathname of the image. The path is relative to
# the HTTP root (aka "root" in [%main%] section). Not all images match it is
# considered, you can use "pattern" option below to filter.
location = archlinux/iso/latest/archlinux-*.iso
# "pattern" is a regular expression. If the pattern is found in image path
# found by "location", then the image is valid. Group capturing is to extract
# image info from image path name.
pattern = archlinux-(\d+\.\d+\.\d+)-(\w+).iso
# Following 3 options describes image info. "type" and "platform" is optional.
# $1, $2... here will be replaced by the string captured in above "pattern".
# Additionally, $0 will be replaced by the whole string matches the pattern.
# "version" is also used as the key to sort images of the same distro.
version = $1
type = CLI-only
platform = $2
# "key_by" (a.k.a group by) should be used when images of different types or platform have
# different version number, see lineageOS below.
# "nosort" should be used when sort is not possible (i.e. no version number),
# in which case listvers should not be set
# nosort would be effective when "nosort" presents, regardless its value

笔者写了一个脚本用于生成 isoinfo.json 和 disk.json ， $disks 变量需要按照实际情况更改

cat /etc/tunasync/genisolist/genisoinfo.sh

#!/bin/bash

web=/path/to/your/mirror/tunasync
disks="/ /mnt/mirror"

while true; do
        python3 /etc/tunasync/genisolist/genisolist.py 2>/dev/null > $web/static/status/isoinfo.json
        echo -n "[" > $web/static/status/disk.json
        df -B 1k --output="size,used" $disks  | awk '{if (FNR==1) ; else {if (FNR>2) printf ","; printf "{\"total_kb\":%s,\"used_kb\":%s}", $1, $2;}}' >> $web/static/status/disk.json
        echo -n "]" >> $web/static/status/disk.json
        sleep 15m
done

这个脚本同样使用 systemd daemon 来管理

cat /usr/lib/systemd/system/tunasync-genisoinfo.service

[Unit]
Description = TUNA mirrors genisoinfo
After=network.target

[Service]
Type=simple
User=tunasync
ExecStart=/etc/tunasync/genisolist/genisoinfo.sh

[Install]
WantedBy=multi-user.target

而 tunasync.json 则直接用 nginx 代理到 tunasync 的端口，这里给出一个 nginx 的配置

cat /etc/nginx/sites-enabled/default

map $http_user_agent $isbrowser {
        default 0;
        "~*validation server" 0;
        "~*mozilla" 1;
}

server {
        listen 80 default_server;
        listen [::]:80 default_server;

        #root /var/www/html;
        root /mnt/mirror/tunasync;

        # Add index.php to the list if you are using PHP
        index index.html index.htm index.nginx-debian.html;

        server_name _;

        error_page 404 /404.html;

        fancyindex_header /fancy-index/before;
        fancyindex_footer /fancy-index/after;
        fancyindex_exact_size off;
        fancyindex_time_format "%d %b %Y %H:%M:%S +0000";
        fancyindex_name_length 256;
        js_path /mnt/mirror/tunasync/static/njs;
        js_import fancyIndexRender from /mnt/mirror/tunasync/static/njs/fancy_index.njs;

        location /fancy-index {
                internal;
                #root /srv/mirrors;

                subrequest_output_buffer_size 100k;
                location = /fancy-index/before {
                        js_content fancyIndexRender.fancyIndexBeforeRender;
                }
                location = /fancy-index/after {
                        js_content fancyIndexRender.fancyIndexAfterRender;
                }
        }

        location /static/tunasync.json {
                proxy_pass http://localhost:21020/jobs;
        }

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                fancyindex on;

                try_files $uri $uri/ =404;
        }
}

其中 fancyindex 的配置参考了这两个 issue #345 #397

另外还需要根据这个 comment 对 /static/njs/fancy_index.njs 和 /static/njs/legacy_index.njs 进行修正。

此时的镜像站已经可以完成最基本的功能了，至于流量图和资源占用图在这个 issue 可以看到可能是 Telegraf+InfluxDB+Grafana 的架构，这些内容超出了本文的讨论范围，故就只写到这里了。

by ISCAS weilinfox