Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

makezip improvements #366

Open
kieranjol opened this issue Nov 1, 2019 · 6 comments
Open

makezip improvements #366

kieranjol opened this issue Nov 1, 2019 · 6 comments

Comments

@kieranjol
Copy link
Owner

Now that we know more about zips, there's much room for improvement:

  • It seems that LZMA compression at level 3 takes the same amount of time as uncompressed (probably due to overheads and usb3 bottlenecks) with a compression ratio of about 62%. LZMA has 6% better compression but takes about 50% more time.
  • File splitting needs to be reduced from 500 gigs to something more like 100 gigs, as more LTO tapes will be filled up this way
  • Looks like the logging of makezip within sipcreator has repetition and some minor errors (like two verifications, but saying status=finished
  • It's worth investigating if 7za actually does the Test at the end of the zipping process. There's a very long stall at the end of a large zipping process, perhaps this is a fixity check, or maybe it is generating the hashes? It's worth figuring this out by intentionally damaging files with a hex editor and also trying to analyse the source code.
  • the mediatrace/mediainfo and dfxml all seem to hang around outside the SIP folder - these should be deleted
  • the extra manifest files in the logs need to be deleted too.
@kieranjol
Copy link
Owner Author

@stephenmcconnachie curious to know about any compression experiments ye did. I'm going to write a blog on my benchmarking - using mac pro 12 threads with USB3 source, Pegasus Thunderbolt RAID destination, 1.6 TB 2K DCDM as input - it takes a similar amount of time (about 16 hours) to compress with LZMA level 3. Maybe this is because of usb3/150K files overhead etc - I usually get a flurry of files being zipped, then it stalls, possibly to gather the crc32 checksums?..

Here's some benchmarks with reel 2 of a DCDM (no easily compressed opening credits)
I found that when i tested with the whole DCDM, the compression ratios for LZMA were much better 14% better) than the single reel values here - probably because of more instances of easily compressible data like closing and opening credits - fades to black etc.

I'm running further tests now on the speeds for uncompressed whole zips just to see how similar it is.


compression_type | compression_level | duration_seconds | source_folder_size | zip_file_size | compression_ratio
Copy | 1 | 0:06:57.940210 | 24450057988 | 24450600882 | 1.00002220420092
Copy | 3 | 0:06:27.893932 | 24450057988 | 24450600882 | 1.00002220420092
Copy | 5 | 0:05:59.902307 | 24450057988 | 24450600882 | 1.00002220420092
Copy | 7 | 0:05:05.365766 | 24450057988 | 24450600882 | 1.00002220420092
Deflate | 1 | 0:05:22.166162 | 24450057988 | 21961904137 | 0.898235257674187
Deflate | 3 | 0:04:42.248375 | 24450057988 | 21961904137 | 0.898235257674187
Deflate | 5 | 0:05:03.380526 | 24450057988 | 21831391361 | 0.892897324485478
Deflate | 7 | 0:08:05.586272 | 24450057988 | 21761943237 | 0.890056917152535
Deflate64 | 1 | 0:05:17.997120 | 24450057988 | 21564308133 | 0.881973700986054
Deflate64 | 3 | 0:05:03.372004 | 24450057988 | 21564308133 | 0.881973700986054
Deflate64 | 5 | 0:05:09.055852 | 24450057988 | 21393258699 | 0.874977830706975
Deflate64 | 7 | 0:09:01.028490 | 24450057988 | 21346412968 | 0.873061854433095
BZip2 | 1 | 0:13:45.623366 | 24450057988 | 20338580411 | 0.831841806714
BZip2 | 3 | 0:15:25.845792 | 24450057988 | 18796070689 | 0.768753624151936
BZip2 | 5 | 0:14:04.008644 | 24450057988 | 18337629336 | 0.750003511034618
BZip2 | 7 | 0:46:38.657094 | 24450057988 | 18337511965 | 0.749998710596105
LZMA | 1 | 0:13:59.295197 | 24450057988 | 20206549136 | 0.82644176737402
LZMA | 3 | 0:12:38.996991 | 24450057988 | 18621387158 | 0.76160912040124
LZMA | 5 | 0:18:08.816210 | 24450057988 | 16949567062 | 0.693232182529742
LZMA | 7 | 0:17:11.254498 | 24450057988 | 16953373782 | 0.693387876230014
PPMd | 1 | 0:16:43.967409 | 24450057988 | 20482188787 | 0.837715345994377
PPMd | 3 | 0:18:22.252274 | 24450057988 | 19155556252 | 0.783456475293493
PPMd | 5 | 0:20:37.762131 | 24450057988 | 17747657942 | 0.725873858896796
PPMd | 7 | 0:30:14.558052 | 24450057988 | 16221178580 | 0.663441313225568

@kieranjol
Copy link
Owner Author

To clarify - I would have expected more of a gap in zipping time between lzma level 3 and uncompressed - it's the uncompressed that does the stalling after about 800 megs or so are processed (probably some I/O overhead and bottlenecks due to image sequence silliness)

@stephenmcconnachie
Copy link

Hi Kieran, we've never actually used any compression method for our DCDMs (or DCP or DPX either) - we use TAR only, not ZIP. I'm interested in your tests for ZIP modelling though, for DCDM, as obviously the data storage overheads are huge...

@kieranjol
Copy link
Owner Author

kieranjol commented Nov 4, 2019

Yeah, we defo don't do it for DCP. For large DCDMs, I keep seeing that the uncompressed 7za processes take longer than Defalte and even LZMA level 3. So approx 60% space saving for even less time than uncompressed. I think it all depends on the I/O and the >100k small files are definitely a factor here, but we are moving ahead with LZMA level 3 for DCDM now anyhow - until we figure out how to get rawcooked to play nice with the folder structure of DCDM - and the confusion that will arise from having audio stems..

@stephenmcconnachie
Copy link

stephenmcconnachie commented Nov 4, 2019 via email

@kieranjol
Copy link
Owner Author

Most of these have been added. I need to add a -uncompressed option to both sipcreator and makezip, as there are some DCDMs that have terrible compression ratios and the maxed out CPUs and added verification time are simply not worth the hassle. It's totally worth it for the times when we get significant reductions in size though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants