Problem Description:
The problem is to quickly match directory entries to a given string using PHP. The string in question is usually part of a file name or even a regular expression. Some of the code presented here does not allow regular expression matching, so take this comparison with a grain of salt. For the comparison itself, I am using the strpos function. This simply means that we will be looking for substrings of a filename, which is usually good enough when looking for a file, auto-completing a name...etc Approaches:
- Standard scan
$dirp = opendir('.'); if($dirp){ $counter = 0; while(FALSE !== ($file = readdir($dirp))){ if(FALSE !== strpos($file, $query)){ $counter += 1; if($counter > $max_hits) break; } } closedir($dirp); }
This is the standard way of iterating over a directory.
It uses little memory.
This code is intuitive,
Cons: It is awfully close to the metal.
The fact that you need to close the directory handle is another inconvenience.
- The glob function:
$files = glob("*$query*", GLOB_NOSORT);
Pros: Very handy function that allows for compact code.
Cons: It does not scale, therefore avoid for anything but toying around, or if you're absolutely certain that your directory contains few files.
- SPL GlobIterator:
$iterator = new GlobIterator("*$query*", FilesystemIterator::KEY_AS_FILENAME); $counter = 0; foreach($iterator as $item){ $counter += 1; if($counter > $max_hits) break; }
Note: The GlobIterator class is a PHP 5.3.0 addition to the Standard PHP Library.
Pros: Scales well. Convenient and efficient.
The code is compact and readable.
Cons: Forces a OO approach
- SPL DirectoryIterator:
$iterator = new DirectoryIterator("."); $counter = 0; if($iterator){ foreach($iterator as $item){ if(FALSE !== strpos($item->getFilename(), $query)){ $counter += 1; if($counter > $max_hits) break; } } }
Note: This is the SPL scandir.
This class was introduced earlier than GlobIterator.
Pros: Scales well
Besides being much more convenient to use than a regular scandir, it also gives access to individual files information, provides good caching, and is easier to modify.
Cons: Slower than GlobIterator
- Extending SPL DirectoryIterator:
class DirectoryReader extends DirectoryIterator{ // constructor.. duh! function __construct($path, $query){ $this->query = $query; /*** pass the $path off to the parent class constructor ***/ parent::__construct($path); } /*** members are only valid if they match the query ***/ function valid(){ if(parent::valid()){ if(FALSE === strpos(parent::getFilename(), $this->query)){ parent::next(); return $this->valid(); } return TRUE; } return FALSE; } private $query = NULL; } // end class $iterator = new DirectoryReader(".", $query); $counter = 0; while($iterator->valid()){ $counter += 1; if($counter > $max_hits) break; $iterator->next(); }
Note: This is the OO way of customizing your iterator.
Pros: Scales well
Readable, makes use of iterators
Cons: Slow
- Flat Index File:
$fp = fopen('index.txt', 'r'); $counter = 0; if($fp){ while($line = fgets($fp)){ if(FALSE !== strpos($line, $query)){ $counter += 1; if($counter > $max_hits) break; } } fclose($fp); }
Note: This is definitely an optimization trick.
Internally, PHP uses the php_stream* functions throughout, and is eventually bound to issuing a multitude of system calls to navigating the file-system data structure, which is likely, in many instances, to be slower than reading through a single file.
Pros: Scales well.
Gets rid of a lot of overhead.
Cons: Requires mostly static data as the index needs to be rebuilt on data change.
- Forcing Cache on Flat index file:
$entries = file('index.txt', FILE_SKIP_EMPTY_LINES); $counter = 0; if($entries){ foreach($entries as $entry){ if(FALSE !== strpos($entry, $query)){ $counter += 1; if($counter > $max_hits) break; } } }
Pros: Gets rid of a lot of overhead.
Cons: Does not scale.
Slower than 6).
Trying to outsmart your compiler is usually not a good idea.
- Flat Index File and PREG:
$fp = fopen('index.txt', 'r'); $counter = 0; if($fp){ while($line = fgets($fp)){ if(0 !== preg_match("/$query/", $line)){ $counter += 1; if($counter > $max_hits) break; } } fclose($fp); }
Note: The difference here is in the matching function.
It gives more flexibility, while keeping a cache of compiled regular expressions.
- Search Tree:
Create look-up directory:
if(2 !== $argc){ fwrite(STDERR, "Usage: {$argv[0]} <source directory>\n"); exit(1); } function process_substr($str, $orig_str){ $base_dir = getcwd(); $chars = preg_split('//', $str, -1, PREG_SPLIT_NO_EMPTY); $path = implode('/', $chars); @mkdir($path, 0777, TRUE); chdir($path); if(!file_exists('results.dat')) $results = array(); else $results = unserialize(file_get_contents('results.dat')); if(!isset($results[$orig_str])) $results[$orig_str] = TRUE; file_put_contents("results.dat", serialize($results)); chdir($base_dir); } function substrings($str, $callback, $min_len=3){ $str_len = strlen($str); for($len = $min_len; $len < $str_len-$min_len; $len += 1){ for($start = 0; $start <= $str_len-$len; $start += 1){ $callback(substr($str, $start, $len), $str); } } } $source_dir = $argv[1]; $dirp = opendir($source_dir); if($dirp){ while(FALSE !== ($entry = readdir($dirp))){ if(strlen($entry) >= 3) substrings($entry, 'process_substr'); } closedir($dirp); }else{ fwrite(STDERR, "Cannot open supplied directory\n"); exit(2); }
Look up a string:
$query = 'foo'; $chars = preg_split('//', $query, -1, PREG_SPLIT_NO_EMPTY); $path = implode('/', $chars); $counter = count(unserialize(file_get_contents("$path/results.dat")));
Notes: In this example, the Search Tree is stored straight to the file system.
There is nothing wrong with this but you have to be very careful since there are some gotchas.
Prefer a DBMS in real implementations.
Pros: Very fast.
Scales well.
Cons: Takes a lot of space.
Requires mostly static data.
Takes some effort to set up.
Some Timings:
Tests are run using PHP 6.0.0 built: Jan 11 2010 19:09:25, Linux and ReiserFS v3 partition
- For a directory containing 125833 entries
$max_hits = 1000000; $query = 'foo';
Opendir: 23 matches in 0.22206497192383 us
Glob: 23 matches in 0.091760158538818 us
GlobIterator: 23 matches in 0.091748952865601 us
DirectoryReader: 23 matches in 0.58006715774536 us
DirectoryIterator: 23 matches in 0.27896213531494 us
Flat File: 23 matches in 0.12962293624878 us
Cached Flat File: 23 matches in 0.13482689857483 us
Flat File and PREG: 23 matches in 0.30407810211182 us
$max_hits = 11; $query = 'foo';
Opendir: 11 matches in 0.076170921325684 us
Glob: 23 matches in 0.092523097991943 us
GlobIterator: 11 matches in 0.09209680557251 us
DirectoryReader: 11 matches in 0.19307088851929 us
DirectoryIterator: 11 matches in 0.092113971710205 us
Flat File: 11 matches in 0.025264978408813 us
Cached Flat File: 11 matches in 0.079818964004517 us
Flat File and PREG: 11 matches in 0.060965061187744 us
$max_hits = 11; $query = 'foo..............................';
Opendir: 0 matches in 0.17899489402771 us
Glob: 0 matches in 0.10579991340637 us
GlobIterator: 0 matches in 0.10554003715515 us
DirectoryReader: 0 matches in 0.58912897109985 us
DirectoryIterator: 0 matches in 0.25362181663513 us
Flat File: 0 matches in 0.12676692008972 us
Cached Flat File: 0 matches in 0.1414749622345 us
Flat File and PREG: 0 matches in 0.31661105155945 us
$max_hits = 501; $query = 'san';
Opendir: 501 matches in 0.18119215965271 us
Glob: 646 matches in 0.095612049102783 us
GlobIterator: 501 matches in 0.09652304649353 us
DirectoryReader: 501 matches in 0.40601682662964 us
DirectoryIterator: 501 matches in 0.21287703514099 us
Flat File: 501 matches in 0.083136081695557 us
Cached Flat File: 501 matches in 0.1099488735199 us
Flat File and PREG: 501 matches in 0.21364402770996 us
- For a directory containing 4869 entries
$max_hits = 11; $query = 'foo';
Opendir: 12 matches in 0.010089159011841 us
Glob: 14 matches in 0.0035510063171387 us
GlobIterator: 12 matches in 0.0045490264892578 us
DirectoryReader: 12 matches in 0.027838945388794 us
DirectoryIterator: 12 matches in 0.0090799331665039 us
Flat File: 12 matches in 0.0038931369781494 us
Cached Flat File: 12 matches in 0.0040509700775146 us
Flat File and PREG: 12 matches in 0.029109001159668 us
Search Tree: 14 matches in 0.00045108795166016 us
Notes:
Some of the approaches cannot be tweaked (e.g. glob).
Also, using a Search Tree is only possible when dealing with medium to small size data as the look-up tree will take a lot of space.
The Search Tree preformance is only influenced by 1)The size of your alphabet 2)The Depth of the tree, hence the number of hits and query length are irrelevant.
Conclusion:
If you have no idea what to pick, and if PHP 5 >= 5.3.0, use SPL GlobIterator(3).
The long answer is that (as it is so often the case) it depends on your requirements: number of hits, length of the input string. In order of preference,
- If your data set is small, mostly static AND efficiency is a issue, use a Search Tree(9)
- If your data is mostly static but too big to build a look-up tree, and the required number of hits is low, a Flat Index File(6) is likely to improve lookup time.
- In every other cases, and if available, use SPL GlobIterator(3)
- SPL DirectoryIterator(4) is fine but avoid extending it
- Standard Scan(1) is fine but use it as a last resort
References:
PHP Source Code
Scandir Function
Standard PHP Library (SPL)
SPL-Standard PHPLibrary
Introduction to Standard PHP Library (SPL)
Please don't hesitate to flame on any nonsense that I may have written
Thanks for an awesome comparison. Any ideas about scandir() option?
ReplyDeletescandir() would fall in the Standard Scan umbrella, as it relies on a call to php_stream_scandir(), itself implemented as a standard opendir loop.
ReplyDeleteWhich means that it should performs closely to (1).
I'll verify that and post the results.
An obvious omission on my part, thanks for pointing it out.
Thank you very much for comparing all iterators, you saved me a half day work.
ReplyDeleteand thanks again for bench-marking.