bag.file_existence_manager module¶
Tools for finding duplicate files.
- class bag.file_existence_manager.FileExistenceManager(store, consider_bytes=0)[source]¶
- Bases: - object- Manages existing files through their hashcodes. - User code can: - add_or_replace_file()
- check whether a - file_exists()
- combine the 2 previous operations with - try_add_file()
 - When checking for existence, only file content is considered; file names are irrelevant. - add_or_replace_file(f, value)[source]¶
- Whether the hash already exists does not matter. - Put the hash for file - fin the store, associated with- value, which is tipically the path of- f.
 
- class bag.file_existence_manager.GdbmStorageStrategy(path='./file_hashes.gdbm', mode='c', sync='s')[source]¶
- Bases: - object- Stores file hashes and file paths in a GNU DBM file. 
- class bag.file_existence_manager.KeepLarger(dups_dir=None)[source]¶
- Bases: - object- Move the smaller file to a “dups” subdirectory. - A callback that keeps the larger file. The smaller file is moved to a “dups” subdirectory. - property dups_dir¶
 
- class bag.file_existence_manager.TransientStrategy[source]¶
- Bases: - object- Stores file hashes and paths in memory only. 
- bag.file_existence_manager.check_dups(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
- Check files in - directoryagainst the database- path.- Example usage: - check_dups(directory='some/directory', callbacks=[print_dup, trash_dup]) 
- bag.file_existence_manager.find_dups(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
- Like - check_dups(), but also updates the database as it goes.- Given a - directory, goes through all files that pass through the predicate- filter, and for each one that is a duplicate, calls the of- callbacks. Returns a dictionary containing the duplicates found.- Example usage: - d = find_dups(directory='some/directory', callbacks=[print_dup, KeepLarger()]) - The signature for writing callbacks is - (original, dup, m), where- originaland- dupare Path instances and- mis the FileExistenceManager instance.
- bag.file_existence_manager.populate_db(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
- Create/update database at - pathby hashing files in- directory.
- bag.file_existence_manager.print_dup(original, duplicate, m)[source]¶
- A callback that just prints the duplicate pair. 
- bag.file_existence_manager.print_dup_unless_empty(original, duplicate, m)[source]¶
- Print the duplicate pair unless the files are empty.