bag.file_existence_manager module¶
Tools for finding duplicate files.
- class bag.file_existence_manager.FileExistenceManager(store, consider_bytes=0)[source]¶
Bases:
objectManages existing files through their hashcodes.
User code can:
add_or_replace_file()check whether a
file_exists()combine the 2 previous operations with
try_add_file()
When checking for existence, only file content is considered; file names are irrelevant.
- add_or_replace_file(f, value)[source]¶
Whether the hash already exists does not matter.
Put the hash for file
fin the store, associated withvalue, which is tipically the path off.
- class bag.file_existence_manager.GdbmStorageStrategy(path='./file_hashes.gdbm', mode='c', sync='s')[source]¶
Bases:
objectStores file hashes and file paths in a GNU DBM file.
- class bag.file_existence_manager.KeepLarger(dups_dir=None)[source]¶
Bases:
objectMove the smaller file to a “dups” subdirectory.
A callback that keeps the larger file. The smaller file is moved to a “dups” subdirectory.
- property dups_dir¶
- class bag.file_existence_manager.TransientStrategy[source]¶
Bases:
objectStores file hashes and paths in memory only.
- bag.file_existence_manager.check_dups(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
Check files in
directoryagainst the databasepath.Example usage:
check_dups(directory='some/directory', callbacks=[print_dup, trash_dup])
- bag.file_existence_manager.find_dups(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
Like
check_dups(), but also updates the database as it goes.Given a
directory, goes through all files that pass through the predicatefilter, and for each one that is a duplicate, calls the ofcallbacks. Returns a dictionary containing the duplicates found.Example usage:
d = find_dups(directory='some/directory', callbacks=[print_dup, KeepLarger()])
The signature for writing callbacks is
(original, dup, m), whereoriginalanddupare Path instances andmis the FileExistenceManager instance.
- bag.file_existence_manager.populate_db(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
Create/update database at
pathby hashing files indirectory.
- bag.file_existence_manager.print_dup(original, duplicate, m)[source]¶
A callback that just prints the duplicate pair.
- bag.file_existence_manager.print_dup_unless_empty(original, duplicate, m)[source]¶
Print the duplicate pair unless the files are empty.