bag.file_existence_manager module

Tools for finding duplicate files.

class bag.file_existence_manager.FileExistenceManager(store, consider_bytes=0)[source]

Bases: object

Manages existing files through their hashcodes.

User code can:

  • add_or_replace_file()

  • check whether a file_exists()

  • combine the 2 previous operations with try_add_file()

When checking for existence, only file content is considered; file names are irrelevant.

add_or_replace_file(f, value)[source]

Whether the hash already exists does not matter.

Put the hash for file f in the store, associated with value, which is tipically the path of f.

close()[source]

Release resources, especially the database file.

file_exists(f)[source]

Return the stored value if the hash for file object f exists.

Otherwise return None.

try_add_file(f, value)[source]

Return existing value or add hash with passed value.

If the hash for file f already exists, just return the associated value. If not, add the hash with the provided value.

class bag.file_existence_manager.GdbmStorageStrategy(path='./file_hashes.gdbm', mode='c', sync='s')[source]

Bases: object

Stores file hashes and file paths in a GNU DBM file.

close()[source]
class bag.file_existence_manager.KeepLarger(dups_dir=None)[source]

Bases: object

Move the smaller file to a “dups” subdirectory.

A callback that keeps the larger file. The smaller file is moved to a “dups” subdirectory.

property dups_dir
class bag.file_existence_manager.TransientStrategy[source]

Bases: object

Stores file hashes and paths in memory only.

close()[source]
bag.file_existence_manager.check_dups(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]

Check files in directory against the database path.

Example usage:

check_dups(directory='some/directory',
           callbacks=[print_dup, trash_dup])
bag.file_existence_manager.file_as_block_iter(afile, blocksize=65536)[source]
bag.file_existence_manager.find_dups(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]

Like check_dups(), but also updates the database as it goes.

Given a directory, goes through all files that pass through the predicate filter, and for each one that is a duplicate, calls the of callbacks. Returns a dictionary containing the duplicates found.

Example usage:

d = find_dups(directory='some/directory',
              callbacks=[print_dup, KeepLarger()])

The signature for writing callbacks is (original, dup, m), where original and dup are Path instances and m is the FileExistenceManager instance.

bag.file_existence_manager.hash_bytestr_iter(bytes_iter, hasher, as_hex_str=False)[source]
bag.file_existence_manager.populate_db(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]

Create/update database at path by hashing files in directory.

bag.file_existence_manager.print_dup(original, duplicate, m)[source]

A callback that just prints the duplicate pair.

bag.file_existence_manager.print_dup_unless_empty(original, duplicate, m)[source]

Print the duplicate pair unless the files are empty.

bag.file_existence_manager.trash_dup(original, duplicate, m)[source]

Callback that puts the duplicate file in the trash.

You need to install the Ubuntu package trash-cli.

bag.file_existence_manager.trash_dup_unless_empty(original, duplicate, m)[source]

Callback that puts the duplicate file in the trash, unless it is empty.

(Sometimes I use zero-length files to present information in their names.)