bag.file_existence_manager module¶
Tools for finding duplicate files.
- class bag.file_existence_manager.FileExistenceManager(store, consider_bytes=0)[source]¶
Bases:
object
Manages existing files through their hashcodes.
User code can:
add_or_replace_file()
check whether a
file_exists()
combine the 2 previous operations with
try_add_file()
When checking for existence, only file content is considered; file names are irrelevant.
- add_or_replace_file(f, value)[source]¶
Whether the hash already exists does not matter.
Put the hash for file
f
in the store, associated withvalue
, which is tipically the path off
.
- class bag.file_existence_manager.GdbmStorageStrategy(path='./file_hashes.gdbm', mode='c', sync='s')[source]¶
Bases:
object
Stores file hashes and file paths in a GNU DBM file.
- class bag.file_existence_manager.KeepLarger(dups_dir=None)[source]¶
Bases:
object
Move the smaller file to a “dups” subdirectory.
A callback that keeps the larger file. The smaller file is moved to a “dups” subdirectory.
- property dups_dir¶
- class bag.file_existence_manager.TransientStrategy[source]¶
Bases:
object
Stores file hashes and paths in memory only.
- bag.file_existence_manager.check_dups(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
Check files in
directory
against the databasepath
.Example usage:
check_dups(directory='some/directory', callbacks=[print_dup, trash_dup])
- bag.file_existence_manager.find_dups(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
Like
check_dups()
, but also updates the database as it goes.Given a
directory
, goes through all files that pass through the predicatefilter
, and for each one that is a duplicate, calls the ofcallbacks
. Returns a dictionary containing the duplicates found.Example usage:
d = find_dups(directory='some/directory', callbacks=[print_dup, KeepLarger()])
The signature for writing callbacks is
(original, dup, m)
, whereoriginal
anddup
are Path instances andm
is the FileExistenceManager instance.
- bag.file_existence_manager.populate_db(path='./file_hashes.gdbm', directory='.', callbacks=[<function print_dup>], filter=<function <lambda>>)[source]¶
Create/update database at
path
by hashing files indirectory
.
- bag.file_existence_manager.print_dup(original, duplicate, m)[source]¶
A callback that just prints the duplicate pair.
- bag.file_existence_manager.print_dup_unless_empty(original, duplicate, m)[source]¶
Print the duplicate pair unless the files are empty.