Hey,
> I have read the bup readme and the design article (well, some parts i
> just skimmed over. very entertaining btw)
> But I still don't get why one should or would use bup over rsnapshot
> or rsync.
bup provides botz deduplication and compression.
Say you have 2 Debian servers. both of them have a bunch of data in
common. (This can be a media library on many computers, youre email or
whatever).
Now you backup the first server. bup's magic happens. files are split up
in chunks, those are packed together in packfiles, nothing really to
bother you. If you want to know the "magic" behind it please read the
DESIGN document or ask for specific explanations.
When finished you start a backup on the second server. Here again files
are split up. Duplicate chunks aren't saved a second time, they are just
referenced. This is what deduplaction is.
> Is this more a proof-of-concept kind of thing "because git is so cool"
> or are there actually benefits over rsnapshot?
> One low hanging fruit with rsnapshot is disk space usage: as soon as a
> file is renamed or has even only 1 bit difference, it's stored again.
> I'm aware of that. But in terms of transmission efficiency (network
> traffic and time consumption), does bup offer anything that rsync
> doesn't? (I'm speaking for the sole use case of doing regular backups
> of VM images btw)
bup uses a algorithm similar to the one that rsnapshot uses for
efficient transmissions. bup uses it for deduplication.
VM images is where bup shines and more or less was designed for: You
have huge files that change a little.
I'm sorry but don't have to much time at the moment, please skim through
the DESIGN document to read what bup does.
> FWIW 2: I wrote an (unfinished) rsync benchmarking tool[*], with one
> of the testruns measuring the efficiency (in bytes sent, and time
> taken) of rsyncing VM image snapshots.
> You can easily hack/extend it for more use cases or even different
> synchronisation backends, so pull requests welcome ;-)
I'll have a look at your benchmarking tool ASAP to see how we can
possibly use it to benchmark bup as well.
I did some "benchmarks" some time ago [1]. I imported my rsnapshot
backups to bup.
TL;TR:
rsnapshot: 12.6G
bup: 4.6G
> Dieter
Thanks for your interest!
Zoran
>
> [*] http://dieter.plaetinck.be/rsyncbench_an_rsync_benchmarking_tool
> But my main concern is transmission traffic and duration; which I guess
> will be comparable to rsync then.
Except in the case of file renames/moves. Bup
1. saves backup space,
2. saves transfer time, and
3. keeps old snapshots as well.
With rsync transfer time is greater, and backup space is greater unless old
snapshots are removed.
just my quick thoughts:
two servers get backed up to another one. I'll do it with Amazon EC2
instances. I'll measure the traffic using NIC counters.
I'll do 3 runs:
Initial backup
a second without changed data
a third with some added data. I'll try to mimic a changing VM Image by
generating a 1G textfile and changing some lines somewhere in the middle.
I'll do all of this for two servers (with different fake VM images) for
both bup and rsnapshot.
I'll measure:
* backup time
* transfered data
* backup space
Any thoughts on that?
Zoran
if you tell me the parameters for a benchmark I'll be happy to do it.
just my quick thoughts:
two servers get backed up to another one. I'll do it with Amazon EC2
instances. I'll measure the traffic using NIC counters.
I'll do 3 runs:
Initial backup
a second without changed data
a third with some added data. I'll try to mimic a changing VM Image by
generating a 1G textfile and changing some lines somewhere in the middle.
I'll do all of this for two servers (with different fake VM images) for
both bup and rsnapshot.
I'll measure:
* backup time
* transfered data
* backup space
Any thoughts on that?
Zoran
Allright I'll do so.
>> a third with some added data. I'll try to mimic a changing VM Image by
>> generating a 1G textfile and changing some lines somewhere in the middle.
>>
>
> You could also use your real images (if you can get them out of bup, that is
> :)
My Upstream connection isn't to good, so uploading some VM images isn't
an option.
What do you mean "not 100% accurate"? There's no reason I can imagine
that the counters on eth0/eth1/etc shouldn't be anything but accurate.
The down side is that you'll get things like TCP headers and
retransmits thrown into your count, as well as traffic on any other
ports you use at the time. You might argue that those are in fact
*more* accurate than not including them, since if (say) bup sent a
whole bunch of one-byte packets, you're technically paying for more
TCP headers. Of course, neither bup nor rsnapshot nor rsync do
anything like that so it doesn't matter.
Beware that tcpdump can also drop packets sometimes, though.
When I'm testing stuff, I usually use iptables accounting rules.
iptables -A OUTPUT -p tcp --port 22 -j ACCEPT
...
iptables -nvL OUTPUT
The rule you added should then have counters for how many bytes
matched that rule. (Note: I haven't tested the above commands, so
there might be a typo or two.)
Have fun,
Avery
# 1. run - first backup
## bup
time: 219 s
transferred: 165417,61 KB
disk space: 156804 KB
## rsnapshot
time: 60s
transferred: 888294,84 KB
953708 KB
# 2. run
## bup
time: 0 s
transferred: 21,65 KB
disk space: 156820 KB
## rsnapshot
time: 4 s
transferred: 1270,8 KB
disk space: 968784 KB
# 3. run - generated a 1G fake-image with data from /dev/urandom on each
## bup
time: 691 s
transferred: 2208538,07 KB
disk space: 2277004 KB
## rsnapshot
time: 133 s
transferred: 2190208,71 KB
disk space: 3083116 KB
# 4. run
## bup
time: 12 s
transferred: 21,99 KB
disk space: 2281276 KB
## rsnapshot
time: 4 s
transferred: 1239,87 KB
disk space: 3098200 KB
# 5. run - changed 1M in the middle of the fake image file
## bup
time: 106 s
transferred: 2249,64 KB
disk space: 2281920 KB
## rsnapshot
time: 119 s
transferred: 3696,62 KB
disk space: 5212580 KB
# total
## bup
time: 1037 s
transferred: 2376248,96 KB
disk space: 2281920 KB
## rsnapshot
time: 320 s
transferred: 3084710,84
disk space: 5212580 KB
If anyone has questions feel free to ask.
Zoran
Man, bup is slow. Definitely need to work on that :)
Thanks for all your work!
Have fun,
Avery
I suppose that this does not really show bup's advantage when backing up
multiple similar systems and moving/renaming big files or directories between
snapshots.
Well, two instances of the same image are pretty much similar systems.
Sure I could have copied and/or moved around my fake images, transferred
around between the server, but I think it's pretty clear where bup
shines. Take a look at the rest of the "benchmark".
I think disk space- and transfer-wise looks pretty good.
What would your improvements be?
Zoran
On the other hand, by the time you've obtained a 9x performance
increase, adding file renames into that might be just bragging :)
Have fun,
Avery
I wrote a import-rsnapshot command, which just needs some testcases
before I submit the patches. I pushed it to my github repo:
http://github.com/zoranzaric/bup/tree/import-rsnapshot
Feel free to test and use it. It should make the transition from
rsnapshot to bup pretty easy.
Keep in mind that bup's master (and the import-rsnapshot branch) don't
save metadata like permissions, yet.
Rob is working on it and it'll be great.
Zoran
On Sat, Oct 16, 2010 at 8:00 AM, Dieter PlaetinckWhat do you mean "not 100% accurate"? There's no reason I can imagine
<dieterp...@gmail.com> wrote:
> On Fri, Oct 15, 2010 at 4:17 PM, Zoran Zaric <li...@zoranzaric.de> wrote:
>> if you tell me the parameters for a benchmark I'll be happy to do it.
>>
>> just my quick thoughts:
>>
>> two servers get backed up to another one. I'll do it with Amazon EC2
>> instances. I'll measure the traffic using NIC counters.
>
> Nic counters don't seem 100% accurate.
> Look at my rsyncbench tool, where I use tcpdump to match the exact traffic
that the counters on eth0/eth1/etc shouldn't be anything but accurate.
The down side is that you'll get things like TCP headers and
retransmits thrown into your count, as well as traffic on any other
ports you use at the time. You might argue that those are in fact
*more* accurate than not including them, since if (say) bup sent a
whole bunch of one-byte packets, you're technically paying for more
TCP headers. Of course, neither bup nor rsnapshot nor rsync do
anything like that so it doesn't matter.
Beware that tcpdump can also drop packets sometimes, though.
Ok, I did my benchmark.
# 1. run - first backup
## bup
time: 219 s
transferred: 165417,61 KB
disk space: 156804 KB
## rsnapshot
time: 60s
transferred: 888294,84 KB
953708 KB
# 2. run
## bup
time: 0 s
transferred: 21,65 KB
disk space: 156820 KB
## rsnapshot
time: 4 s
transferred: 1270,8 KB
disk space: 968784 KB
# 3. run - generated a 1G fake-image with data from /dev/urandom on each
## bup
time: 691 s
transferred: 2208538,07 KB
disk space: 2277004 KB
## rsnapshot
time: 133 s
transferred: 2190208,71 KB
disk space: 3083116 KB
Exactly.
Zoran
You simply can't compress randomness; try it sometime. If you could
compress it, it wouldn't be random.
> Then you do a "change the 511. 1M block" (what does that mean?)
He listed the commands; it's obvious from those what he means.
> So, what happens here, 2 images have 1MB changed, so ~2MB transfer is
> needed, which bup does nicely and rsync is a bit less efficient, and then
> rsnapshot needs to save the full 1GB files again, causing the big 2GB
> storage increase.
>
> It isn't really the benchmark I was expecting, but the numbers are clear
> enough. I want to see some more bytes-transferred numbers for real VM
> images though, maybe I find the time to do that myself, sometime.
The trends persist across any sort of files. Other than the
improvements over rsnapshot that come from renaming, of course, since
presumably you aren't renaming your VM images. As far as bup is
concerned, renames are only the change of a few bytes (the filename)
and not anything else.
> How suitable is bup for real-life backups of VM images? (Since I don't care
> about file metadata, and run it only on Linux, I should be pretty safe,
> right?) I see the DESIGN document claims there is no "bup restore" but the
> README even uses "bup restore" in the examples. It looks that by now,
> restoring works properly, right?
To paraphrase: Zoran backed up the contents of his VM images, while
you plan to back up the raw VM disk files using your host system.
I'm not quite sure why you don't just test it on your own data; it
will only take a few minutes to set up, and then you'll have the final
answer on *your* data, not a synthetic benchmark.
But anyway, I expect that you'll find the results quite excellent. I
certainly have when I've backed up my VM images. VM disks tend to
have a lot of duplication outside the gzip compression window (about
32-128k) because when you copy a file around and then delete the
original, you end up with chunks of the same file in two totally
different places on the disk. gzip fails badly at compressing such
things - it only really fixes redundancy inside that small window -
and so bup usually compresses better than gzip on even a *single* copy
of a VM disk image, if that VM disk has been busy in the past.
rsnapshot just stores (I think gzipped?) copies of the VM disk, so the
disk space usage of bup should be much less.
As for file transfer, I'd expect bup and rsnapshot to be pretty close
to each other - for the first backup. After that, bup retains
persistent state on the client side (the .idx and .midx files) which
will allow it to do future backups while sending far fewer bytes. And
of course you won't have to store the entire file over again like you
would with rsnapshot.
Basically: just try it. It's better.
(Except for speed. It's getting to be time to rewrite more of bup in
C, I guess. :))
Have fun,
Avery
On Mon, Oct 18, 2010 at 3:15 PM, Dieter PlaetinckYou simply can't compress randomness; try it sometime. If you could
<dieterp...@gmail.com> wrote:
> Clearly git is not able to do any special dedup (which is not suprising,
> since the image is 100% random, I guess real VM images have some room for
> dedup) or compression (which suprises me a bit, I thought git stores blobs
> compressed)
compress it, it wouldn't be random.
He listed the commands; it's obvious from those what he means.
> Then you do a "change the 511. 1M block" (what does that mean?)
The trends persist across any sort of files. Other than the
> So, what happens here, 2 images have 1MB changed, so ~2MB transfer is
> needed, which bup does nicely and rsync is a bit less efficient, and then
> rsnapshot needs to save the full 1GB files again, causing the big 2GB
> storage increase.
>
> It isn't really the benchmark I was expecting, but the numbers are clear
> enough. I want to see some more bytes-transferred numbers for real VM
> images though, maybe I find the time to do that myself, sometime.
improvements over rsnapshot that come from renaming, of course, since
presumably you aren't renaming your VM images. As far as bup is
concerned, renames are only the change of a few bytes (the filename)
and not anything else.
> How suitable is bup for real-life backups of VM images? (Since I don't careTo paraphrase: Zoran backed up the contents of his VM images, while
> about file metadata, and run it only on Linux, I should be pretty safe,
> right?) I see the DESIGN document claims there is no "bup restore" but the
> README even uses "bup restore" in the examples. It looks that by now,
> restoring works properly, right?
you plan to back up the raw VM disk files using your host system.
I'm not quite sure why you don't just test it on your own data; it
will only take a few minutes to set up, and then you'll have the final
answer on *your* data, not a synthetic benchmark.
But anyway, I expect that you'll find the results quite excellent. I
certainly have when I've backed up my VM images. VM disks tend to
have a lot of duplication outside the gzip compression window (about
32-128k) because when you copy a file around and then delete the
original, you end up with chunks of the same file in two totally
different places on the disk. gzip fails badly at compressing such
things - it only really fixes redundancy inside that small window -
and so bup usually compresses better than gzip on even a *single* copy
of a VM disk image, if that VM disk has been busy in the past.
rsnapshot just stores (I think gzipped?) copies of the VM disk, so the
disk space usage of bup should be much less.
As for file transfer, I'd expect bup and rsnapshot to be pretty close
to each other - for the first backup. After that, bup retains
persistent state on the client side (the .idx and .midx files) which
will allow it to do future backups while sending far fewer bytes. And
of course you won't have to store the entire file over again like you
would with rsnapshot.
Basically: just try it. It's better.
(Except for speed. It's getting to be time to rewrite more of bup in
C, I guess. :))
Have fun,
Avery
yeah bup should work fine and restoring works ;)
I'm sorry if something i wrote wasn't clear. I'm not a native speaker,
so please excuse my inaccuracies.
Zoran
> (Except for speed. It's getting to be time to rewrite more of bup in
> C, I guess. :))
Starting with a pack indexer? :D
I'm limping along manually indexing packs on my big backup, which is
mostly OK, since I'm not making really big changes there, just a pack or
two now and then.
-Z
As a matter of fact, the pack indexer would probably be plenty fast
even in python :) But you're right, it does need to be written. I
happen to be on a road trip right now that's interfering with such
things, but I promise it's on my list.
Or if someone else around here is feeling motivated, it's not actually
too hard. Just look for the place where we call 'git index-pack' and
replace it :)
Have fun,
Avery
In my case I was backing up a remote $HOME to a low powered NAS. Hard
link trees are simply not an appropriate solution. I think by the time
I gave up on rsnapshot, I had more space used in filesystem overhead
than files themselves.