TVL depot development (mail to depot@tvl.su)
 help / color / mirror / code / Atom feed
From: Askar Safin <safinaskar@gmail•com>
To: depot@tvl.su
Subject: here is how to write efficient castore
Date: Wed, 22 May 2024 15:52:09 +0300	[thread overview]
Message-ID: <CAPnZJGB7d+O02TLWMHRcqqR+7QsC=8iup2Zo5pSMa2cFc4q2Kg@mail.gmail.com> (raw)

Hi. I spent a lot of time writing my own simple (but very efficient)
castore in Rust. It is parallel. It deduplicates data using chunks
(but not CDC). And it compresses. And it has 200 lines of Rust code
only!!! It has many important optimizations, which are absent in
alternative castores. Consider taking ideas from my castore and
comparing efficiency of my castore and yours. Here is link to github
comment with my castore (small and self-contained) implementation:
https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985
(it is called azwyon). (linked version of code is slightly old, it
uses "rayon" instead of "pariter" [which is used by current version of
azwyon], but this doesn't affect performance of the code, so don't
worry; if you need current version of the code, i can send it)

here are benchmarks, which compare azwyon to many other alternative
castores: https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394

Also see the whole thread (
https://github.com/borgbackup/borg/issues/7674 ) for discussion of
various optimizations made in azwyon.

Also see this summary, which describes what is wrong with existing castores:
https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl

Assume my code is public domain.

Also I want to note that I'm slightly cheating: I use fixed sized
chunking instead of CDC. This is what I exactly need for my problem
domain (i. e. for storing VM images). Your problem domain probably
need CDC. So, if you want to take my code, you will probably have to
switch it to CDC. This will decrease its speed, but will allow more
compression.

Okay, so how did I achieve this? Why my azwyon is so efficient? Well,
here are reasons for success:

- I used Rust as opposed to slower languages, such as Go
- I used Rust, and this enabled me to have "fearless concurrency"
- I used this library for parallelization
https://dpc.pw/posts/adding-parallelism-to-your-rust-iterators/ and it
is absolutely great (note: version linked above uses rayon instead of
pariter, but don't worry, this doesn't really matter)
- I "cheated" by using fixed sized chunking instead of slower CDC
(fortunately, fixed sized chunking is exactly what I need for my
particular kind of data)
- I used zstd as opposed to less efficient compression methods
- I used blake3 as opposed to slower hash algorithms

-- 
Askar Safin

                 reply	other threads:[~2024-05-22 12:52 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPnZJGB7d+O02TLWMHRcqqR+7QsC=8iup2Zo5pSMa2cFc4q2Kg@mail.gmail.com' \
    --to=safinaskar@gmail$(echo .)com \
    --cc=depot@tvl.su \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://code.tvl.fyi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).