Md5sum For Mac Program
Alternatives to MD5sums for Windows, Linux, Mac, Web, BSD and more. Filter by license to discover only free or Open Source alternatives. This list contains a total of 25+ apps similar to MD5sums.
The md5sum program does not provide checksums for directories. I want to get a single MD5 checksum for the entire contents of a directory, including files in sub-directories. That is, one combined checksum made out of all the files. Is there a way to do this?
17 Answers
The right way depends on exactly why you're asking:
Option 1: Compare Data Only
If you just need a hash of the tree's file contents, this will do the trick:
This first summarizes all of the file contents individually, in a predictable order, then passes that list of file names and MD5 hashes to be hashed itself, giving a single value that only changes when the content of one of the files in the tree changes.
Unfortunately, find -s
only works with BSD find(1), used in macOS, FreeBSD, NetBSD and OpenBSD. To get something comparable on a system with GNU or SUS find(1), you need something a bit uglier:
We've replaced find -s
with a call to sort
. The -k 2
bit tells it to skip over the MD5 hash, so it only sorts the file names, which are in field 2 through end-of-line, by sort
's reckoning.
There's a weakness with this version of the command, which is that it's liable to become confused if you have any filenames with newlines in them, because it'll look like multiple lines to the sort
call. The find -s
variant doesn't have that problem, because the tree traversal and sorting happen within the same program, find
.
In either case, the sorting is necessary to avoid false positives: the most common Unix/Linux filesystems don't maintain the directory listings in a stable, predictable order. You might not realize this from using ls
and such, which silently sort the directory contents for you. find
without -s
or a sort
call is going to print out files in whatever order the underlying filesystem returns them, which will cause this command to give a changed hash value if the order of files given to it as input changes.
You might need to change the md5sum
commands to md5
or some other hash function. If you choose another hash function and need the second form of the command for your system, you might need to adjust the sort
command accordingly. Another trap is that some data summing programs don't write out a file name at all, a prime example being the old Unix sum
program.
This method is somewhat inefficient, calling md5sum
N+1 times, where N is the number of files in the tree, but that's a necessary cost to avoid hashing file and directory metadata.
Option 2: Compare Data and Metadata
If you need to be able to detect that anything in a tree has changed, not just file contents, ask tar
to pack the directory contents up for you, then send it to md5sum
:
Because tar
also sees file permissions, ownership, etc., this will also detect changes to those things, not just changes to file contents.
This method is considerably faster, since it makes only one pass over the tree and runs the hash program only once.
As with the find
based method above, tar
is going to process file names in the order the underlying filesystem returns them. It may well be that in your application, you can be sure you won't cause this to happen. I can think of at least three different usage patterns where that is likely to be the case. (I'm not going to list them, because we're getting into unspecified behavior territory. Each filesystem can be different here, even from one version of the OS to the next.)
If you find yourself getting false positives, I'd recommend going with the find | cpio
option in Gilles' answer.
The checksum needs to be of a deterministic and unambiguous representation of the files as a string. Deterministic means that if you put the same files at the same locations, you'll get the same result. Unambiguous means that two different sets of files have different representations.
Data and metadata
Making an archive containing the files is a good start. This is an unambiguous representation (obviously, since you can recover the files by extracting the archive). It may include file metadata such as dates and ownership. However, this isn't quite right yet: an archive is ambiguous, because its representation depends on the order in which the files are stored, and if applicable on the compression.
A solution is to sort the file names before archiving them. If your file names don't contain newlines, you can run find | sort
to list them, and add them to the archive in this order. Take care to tell the archiver not to recurse into directories. Here are examples with POSIX pax
, GNU tar and cpio:
Names and contents only, the low-tech way
If you only want to take the file data into account and not metadata, you can make an archive that includes only the file contents, but there are no standard tools for that. Instead of including the file contents, you can include the hash of the files. If the file names contain no newlines, and there are only regular files and directories (no symbolic links or special files), this is fairly easy, but you do need to take care of a few things:
We include a directory listing in addition to the list of checksums, as otherwise empty directories would be invisible. The file list is sorted (in a specific, reproducible locale — thanks to Peter.O for reminding me of that). echo
separates the two parts (without this, you could make some empty directories whose name look like md5sum
output that could also pass for ordinary files). We also include a listing of file sizes, to avoid length-extension attacks.
By the way, MD5 is deprecated. If it's available, consider using SHA-2, or at least SHA-1.
Names and data, supporting newlines in names
Here is a variant of the code above that relies on GNU tools to separate the file names with null bytes. This allows file names to contain newlines. The GNU digest utilities quote special characters in their output, so there won't be ambiguous newlines.
A more robust approach
Here's a minimally tested Python script that builds a hash describing a hierarchy of files. It takes directories and file contents into accounts and ignores symbolic links and other files, and returns a fatal error if any file can't be read.
GillesGillesHave a look at md5deep. Some of the features of md5deep that may interest you:
Recursive operation - md5deep is able to recursive examine an entire directory tree. That is, compute the MD5 for every file in a directory and for every file in every subdirectory.
Comparison mode - md5deep can accept a list of known hashes and compare them to a set of input files. The program can display either those input files that match the list of known hashes or those that do not match.
...
If your goal is just to find differences between two directories, consider using diff.
Try this:
You can hash every file recursively and then hash the resulting text:
md5deep is required.
File contents only, excluding filenames
I needed a version that only checked the filenames because the contents reside in different directories.
This version (Warren Young's answer) helped a lot, but my version of md5sum
outputs the filename (relative to the path I ran the command from), and the folder names were different, therefore even though the individual file checksums matched, the final checksum didn't.
To fix that, in my case, I just needed to strip off the filename from each line of the find
output (select only the first word as separated by spaces using cut
):
solution:
works fast and easier solution then bash scripting.
see doc: https://pypi.python.org/pypi/checksumdir/1.0.5
nix-hash
from the Nix package manager
The command nix-hash computes the cryptographic hash of the contents of each path and prints it on standard output. By default, it computes an MD5 hash, but other hash algorithms are available as well. The hash is printed in hexadecimal.
The hash is computed over a serialisation of each path: a dump of the file system tree rooted at the path. This allows directories and symlinks to be hashed as well as regular files. The dump is in the NAR format produced by nix-store --dump. Thus, nix-hash path yields the same cryptographic hash as nix-store --dump path | md5sum.
I use this my snippet for moderate volumes:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 cat | md5sum -
and this one for XXXL:
find . -xdev -type f -print0 | LC_COLLATE=C sort -z | xargs -0 tail -qc100 | md5sum -
A good tree check-sum is the tree-id of Git.
There is unfortunately no stand-alone tool available which can do that (at least I dont know it), but if you have Git handy you can just pretend to set up a new repository and add the files you want to check to the index.
This allows you to produce the (reproducible) tree hash - which includes only content, file names and some reduced file modes (executable).
Md5sum For Mac Programmatically
As a follow-up to this excellent answer, if you find yourself wanting to speed up the calculation of the checksum for a large directory, try GNU Parallel:
(This is using a Mac with md5
, replace as needed.)
The -k
flag is important, that instructs parallel
to maintain order, otherwise the overall sum can change run to run even if the files are all the same. -n 100
says to run each instance of md5
with 100 arguments, this is a parameter you can tweak for best run time. See also -X
flag of parallel
(though in my personal case that caused an error.)
A script which is well tested and supports a number of operations including finding duplicates, doing comparisons on both data and metadata, showing additions as well as changes and removals, you might like Fingerprint.
Fingerprint right now doesn't produce a single checksum for a directory, but a transcript file which includes checksums for all files in that directory.
This will generate index.fingerprint
in the current directory which includes checksums, filenames and file sizes. By default it uses both MD5
and SHA1.256
.
In the future, I hope to add support for Merkle Trees into Fingerprint which will give you a single top-level checksum. Right now, you need to retain that file for doing verification.
I didn't want new executables nor clunky solutions so here's my take:
Rui F RibeiroDoing individually for all files in each directory.
Migration to POSIX archive format affects GNU Tar based checksums
This answer is intended to be a supplementary update to the approach of using Tar output to hash the contents of directories, as it was proposed (among other things) in the excellent answers of Warren Young and Gilles some time ago.
Since then, at least openSUSE (since its release 12.2) changed their default GNU Tar format from 'GNU tar 1.13.x format' to the (slightly) superior 'POSIX 1003.1-2001 (pax) format'. Also upstream (among the developers of GNU Tar) they discuss to perform the same migration, see for example the last paragraph on this page of the GNU Tar manual:
The default format for GNU tar is defined at compilation time. You may check it by running tar --help
, and examining the last lines of its output. Usually, GNU tar is configured to create archives in gnu
format, however, future version will switch to posix
.
(This page also gives a nice review on the different archive formats that are available with GNU Tar.)
In our case, where we tar the directory contents and hash the result, and without taking specific measures, a change from GNU to POSIX format has the following consequences:
In spite of identical directory contents, the resulting checksum will be different.
In spite of identical directory contents, the resulting checksum will be different from run to run if the default pax headers are used.
The latter comes from the fact, that the POSIX (pax) format includes extended pax headers which are determined by a format string that defaults to %d/PaxHeaders.%p/%f
in GNU Tar. Within this string, the specifier %p
is replaced by the process ID of the generating Tar process, which of course is different from run to run. See this section of the GNU Tar manual and in particular this one for details.
Just now, dating from 2019-03-28, there is a commit accepted upstream that defuses this issue.
Md5sum Ubuntu
So, to be able to continue using GNU Tar in the given use case, I can recommend the following alternative options:
Use the Tar option
--format=gnu
to explicitly tell Tar to generate the archive in the 'old' format. This is mandatory to validate 'old' checksums.Use the newer POSIX format, but explicitly specify a suitable pax header, for example by
--pax-option='exthdr.name=%d/PaxHeaders/%f'
. However, this breaks the backward compatibility to 'old' checksums.
Here is a Bash code fragment that I use on a regular basis to compute checksums of directory contents including metadata:
Herein, <paths>
is replaced by a space separated list of the paths of all directories that I want to be covered by the checksum. The purpose of using the C locale, the null byte separation of filenames, and of using find and sort to get a filesystem independent order of the files in the archive is already sufficiently discussed in other answers.
The surrounding parentheses keep the LC_ALL
setting local in a subshell.
In addition, I use the expression ! -type s
with find
to avoid warnings from Tar that occur if socket files are part of the directory contents: GNU Tar does not archive sockets. If you prefer to be notified about skipped sockets, leave that expression away.
I use --numeric-owner
with Tar, to be able to verify the checksums later even on systems, where not all of the file owners are known.
The --atime-preserve
option for Tar is better omitted if any of the <paths>
lies on a read-only mounted device. Otherwise you will be warned for each single file whose access timestamp Tar was not able to restore. For write enabled <paths>
, I use this option, well, to preserve the access timestamps in the hashed directories.
The Tar option --no-recursion
, which was already used in Gilles proposal, prevents Tar from recursively descent into directories by itself, and to operate instead file by file on whatever it gets fed from the sorted find
output.
And finally, it is not true that I use md5sum
: I actually use sha256sum
.
- First things first, don't hog the available memory! Hash a file in chunks rather than feeding the entire file.
- Different approaches for different needs/purpose (all of the below or pick what ever applies):
- Hash only the entry name of all entries in the directory tree
- Hash the file contents of all entries (leaving the meta like, inode number, ctime, atime, mtime, size, etc., you get the idea)
- For a symbolic link, its content is the referent name. Hash it or choose to skip
- Follow or not to follow(resolved name) the symlink while hashing the contents of the entry
- If it's a directory, its contents are just directory entries. While traversing recursively they will be hashed eventually but should the directory entry names of that level be hashed to tag this directory? Helpful in use cases where the hash is required to identify a change quickly without having to traverse deeply to hash the contents. An example would be a file's name changes but the rest of the contents remain the same and they are all fairly large files
- Handle large files well(again, mind the RAM)
- Handle very deep directory trees (mind the open file descriptors)
- Handle non standard file names
- How to proceed with files that are sockets, pipes/FIFOs, block devices, char devices? Must hash them as well?
- Don't update the access time of any entry while traversing because this will be a side effect and counter-productive(intuitive?) for certain use cases.
This is what I have on top my head, any one who has spent some time working on this practically would have caught other gotchas and corner cases.
Here's a tool(disclaimer: I'm a contributor to it) dtreetrawl, very light on memory, which addresses most cases, might be a bit rough around the edges but has been quite helpful.
An example human friendly output:
I would like to create a md5 checksum list for all files in a directory.
I want to cat filename | md5sum > ouptput.txt
. I want to do this in 1 step for all files in my directory.
Any assistance would be great.
Oli♦4 Answers
You can pass md5sum
multiple filenames or bash expansions:
If you want to get fancy you can use things like find
to drill down and filter the files, as well as working recursively:
A great checksum creation/verification program is rhash. It creates even SFV compatible files, and checks them too.
It supports md4, md5, sha1, sha512, crc32 and many many other.
Moreover it can do recursive creation (-r option) like md5deep or sha1deep.
Last but not least you can format the output of the checksum file; for example:
outputs a CSV file including the full path of files recursively starting with the /home
directory.
I find extremely useful even the -e option rename files by inserting crc32 sum into name.
You can change 'md5sum' with 'rhash' in the PhoenixNL72 examples.
Here are two more extensive examples:
Create an md5 file in each directory which doesn't already have one, with absolute paths:
Create an md5 file in each folder which doesn't already have one: no paths, only filenames:
What differs between 1 and 2 is the way the files are presented in the resulting md5 file.
The commands do the following:
- Build a list of directory names for the current folder. (Tree)
- Sort the folder list.
- Check in each directory if the file @md5sum.md5 exists. Output Skipped if it exists, output Processing if it doesn't exist.
- If the @md5Sum.md5 file doesn't exist, md5Sum will generate one with the checksums of all the files in the folder.5) Set the generated @md5Sum.md5 file to read only.
The output of this entire script can be redirected to a file (.....;done > test.log)or piped to another program (like grep).The output will only tell you which directories where skipped and which have been processed.
After a successful run, you will end up with an @md5Sum.md5 file in each subdirectory of your current directory
I named the file @md5Sum.md5 so it'll get listed at the top of the directory in a samba share.
Verifying all @md5Sum.md5 files can be done by the next commands:
Afterwards you can grep the checklog.txt using grep -v OK to get a list of all files that differ.
To regenerate an @md5Sum.md5 in a specific directory, when you changed or added files for instance, either delete the @md5Sum.md5 file or rename it and run the generate command again.
Md5sum For Mac
I hit this issue, and while the solutions above are elegant, I wanted a quick and dirty hack for this situation: 1 directory, with subdirectories one level deep inside it.
Md5sum For Mac Programming
So, enter the directory in a shell and run:
This gets all the files in the top level directory, removes the error warning about the sub directories being directories, and then runs md5sums on the subdirectory contents. Advantage: easy to remember, does exactly what it's supposed to do. I always get confused by find syntax and can never remember it off the top of my head, so no need to loop etc, dealing with spaces in directory names, this one liner worked fine. Not a robust powerful solution, no good for > 1 level of subdirectories, but a quick and easy fix for the problem.