Wednesday, January 20, 2010

Reducing the Granularity of a Map

You've got this database of place coordinates, but find that two or more places are "too" close from each other.
For instance, you want to make sure that no 2 places are closer than d mile(s) from each other.
The problem is to find the minimum set of nodes to remove in order to satisfy the restriction.

While it is easy to visualize a solution for a graph of small size, it can quickly get complicated.
Consider these 3 places.


How do you decide which node(s) to remove in order to maintain the restriction that d > 3 for all pairs of points?
There clearly are 3 possibilities:
  • Remove red point
  • Remove black and green points
  • Remove all 3 points

The good news is that this is a well known optimization problem: the Independent Set problem.
The bad news is that it is NP-hard.
A simpler problem that can be solved with a greedy algorithm is finding a maximal set.
Given two nodes A and B such that distance(A, B) < k, remove the node with the least neighbors.
In this algorithm, the map is simply a graph where there is an edge between two nodes if they are sufficiently close from each other (if the two locations are neighbors).

loop map as place1:
loop map as place2:
  if place1 != place2 and
     distance(place1, place2) < k:
    if neighbors(place1) > neighbors(place2):
      discard(place2)
    else:
      discard(place1)

Of course, any domain knowledge should be exploited.
For instance, if the map represents the biggest cities in country X, then the population size of each location should be taken into account before discarding a node.

Geographical Mapping and Google Maps

What you want to do: Partition a geographical area, where each area contains elementary entities.
You want to set up a web service where users are allowed to add locations by clicking on a map, say using the Google Maps API.
The problem is that adding all the locations at the top level of a tree will quickly lead to a bloated data structure increasing search time and make it hard to scale.
One approach is to divide the area into partitions, a bit like the United States is divided into states.
States work fine as their union does cover the whole country, but their respective shapes is unnecessarily tortuous.
All we need really is a bunch of possibly overlapping circular areas defined by a center and a radius r.



It turns out it is relatively easy to find databases of US populated locations with their coordinates.
Simply use these as your centers and pick a radius that will cover the whole territory, depending on the granularity of your database.
That way, users browse by locations and add markers for each location, allowing your data structure to scale.

A problem that this layout creates is that of marker visibility.
Assuming two towns are rather close from each other and one user adds a place through town A, then all users browsing that area through town B will not see this marker.
A solution is to store all the neighbor locations of a given location.
Given a location L1, a location L2 is a neighbor of L1 iff distance(L1, L2) < 2r, where r is the radius of a location.
Using this schema, town B would be listed as a neighbor of town A and, as such, its list of place markers would be displayed by all its neighbors, including town A.
That way, no matter through which location a user adds a marker, all other users will be able to see it if they browse that area through neighbor locations (locations that would cover the marker).

What if I select a location on the west coast, then add a marker on the east coast?
This will break the system as this marker will not be visible by any other locations than the original location and its neighbors.
That's why in addition to storing neighbors of a location, users should not be able to add place markers more than r distance from a location's center.

If all these constraints are respected, you've got a scalable database of places, under the provision that users know the general location that they are looking for. A reasonable assumption.

Desktop Tower Defense: My Computer Build

Remember that disturbingly addicting flash game from 2 years ago?
There was two things to consider: how to arrange your turrets (your build), and which turrets to pick.
Can you have your computer produce the perfect layout?
Assuming that you assign a score for each area of the grid based on neighbor blocks and longest distance from start to finish, building a machine that maximizes the setup is NP-complete.
What about simply maximizing the overall vertical and horizontal distance that creeps have to follow? Still NP-complete.
Let's forget trying to build a deterministic algorithm and fall back on Monte Carlo instead.
The strategy is to randomly place blocks and retain the layouts with longest path from start to finish.
Here is the best layout found by my machine:

screenshot

If you look at it closely, you'll see a resemblance with what a human would build

Here's the Python code:
graph.py
import sys
import random
import heapq

class Graph:
  NODE_FREE = 0
  NODE_OCCUPIED = -1
  NODE_V_START = 10
  NODE_V_END = 11
  NODE_H_START = 20
  NODE_H_END = 21

  visited = []
  
  def __init__(self, width, height):
    self.width = width
    self.height = height
    self.nnodes = width * height + 4 # grid + start and end points
    self.nedges = 0
    self.nodes = []
    self.edges = []
    self.tower_pos = [] # list of nodes
    self.vstart = -1
    self.vend = -1
    self.hstart = -1
    self.hend = -1

    """ Add the nodes """ 
    for _ in range(self.nnodes):
      self.nodes.append(self.NODE_FREE)
      row = []
      for _ in range(self.nnodes):
        row.append(False)
        self.visited.append(False)
      self.edges.append(row)

    """ Set up the grid """
    active_nodes = self.width * self.height
    for n in range(active_nodes):
      if n % self.width != 0:
        self.add_edge(n, n - 1)
      if n % self.width != self.width - 1:
        self.add_edge(n, n + 1)
      if n - self.width >= 0:
        self.add_edge(n, n - self.width)
      if n + self.width < active_nodes:
        self.add_edge(n, n + self.width)

    """ 
        Define vertical and horizontal starting and ending nodes (4)
        Link start and end nodes to the entry and exit nodes (nodes #10-17, #529-536
        and {208;234;260;286;312;338}, {233;259;285;311;337;363})
    """

    self.vstart = self.nnodes - 1
    self.vend = self.nnodes - 2
    self.hstart = self.nnodes - 3
    self.hend = self.nnodes - 4
  
    self.nodes[self.vstart] = self.NODE_V_START
    self.nodes[self.vend] = self.NODE_V_END
    self.nodes[self.hstart] = self.NODE_H_START
    self.nodes[self.hend] = self.NODE_H_END

    self.add_edge(self.vstart, 10)
    self.add_edge(self.vstart, 11)
    self.add_edge(self.vstart, 12)
    self.add_edge(self.vstart, 13)
    self.add_edge(self.vstart, 14)
    self.add_edge(self.vstart, 15)
    self.add_edge(self.vstart, 16)
    self.add_edge(self.vstart, 17)

    self.add_edge(self.vend, 529)
    self.add_edge(self.vend, 530)
    self.add_edge(self.vend, 531)
    self.add_edge(self.vend, 532)
    self.add_edge(self.vend, 533)
    self.add_edge(self.vend, 534)
    self.add_edge(self.vend, 535)
    self.add_edge(self.vend, 536)

    self.add_edge(self.hstart, 208)
    self.add_edge(self.hstart, 234)
    self.add_edge(self.hstart, 260)
    self.add_edge(self.hstart, 286)
    self.add_edge(self.hstart, 312)
    self.add_edge(self.hstart, 338)

    self.add_edge(self.hend, 233)
    self.add_edge(self.hend, 259)
    self.add_edge(self.hend, 285)
    self.add_edge(self.hend, 311)
    self.add_edge(self.hend, 337)
    self.add_edge(self.hend, 363)

  def display_vars(self):
    print self.nnodes, ": ", self.nodes
    print self.nedges, ": ", self.edges

  def display(self):
    for row in range(self.height):
      for col in range(self.width):
        if self.NODE_FREE == self.nodes[row * self.width + col]:
          print "-",
        else:
          print "X",
      print
    print

  def display_edges(self):
    for i in range(self.nnodes):
      print i, ": ",
      for j in range(self.nnodes):
        if True == self.edges[i][j]:
          print j, "(", self.nodes[j], ")", ", ",
      print

  def reset(self):
    self.tower_pos = []
    for i in range(self.width * self.height):
      self.nodes[i] = self.NODE_FREE
    self.visited = []
    for _ in range(self.nnodes):
      self.visited.append(False)

  """ Initialize the graph with turrets on the nodes given in the list @tower_pos """
  def restore(self, tower_pos):
    self.reset()
    for i in tower_pos:
      self.place_tower_on_node(i)

  def add_node(self, val):
    self.nnodes += 1
    self.nodes.append(val)
    row = []
    for _ in range(self.max_nnodes):
      row.append(False)
      self.visited.append(False)
    self.edges.append(row)
    return self.nnodes-1

  def add_edge(self, src, dst):
    self.edges[src][dst] = True
    self.edges[dst][src] = True
    self.nedges += 1 

  def del_edge(self, src, dst):
    self.edges[src][dst] = False
    self.edges[dst][src] = False
    self.nedges -= 1

  def place_tower(self, row, col):
    """ Updates the graph to reflect the drawing of a tower """
    if row < 0 or col < 0 or row >= self.height-1 or col >= self.width-1:
      return False
    node_num = row * self.width + col
    if self.nodes[node_num] != self.NODE_FREE or self.nodes[node_num + 1] != self.NODE_FREE or self.nodes[node_num + self.width] != self.NODE_FREE or self.nodes[node_num + self.width + 1] != self.NODE_FREE:
      return False
    self.nodes[node_num] = self.NODE_OCCUPIED
    self.nodes[node_num + 1] = self.NODE_OCCUPIED
    self.nodes[node_num + self.width] = self.NODE_OCCUPIED
    self.nodes[node_num + self.width + 1] = self.NODE_OCCUPIED
    self.tower_pos.append(node_num)
    return True

  def place_tower_on_node(self, node):
    """ Updates the graph to reflect the drawing of a tower """
    row = node / self.width
    col = node % self.width
    if row < 0 or col < 0 or row >= self.height-1 or col >= self.width-1:
      return False
    if self.nodes[node] != self.NODE_FREE or self.nodes[node + 1] != self.NODE_FREE or self.nodes[node + self.width] != self.NODE_FREE or self.nodes[node + self.width + 1] != self.NODE_FREE:
      return False
    self.nodes[node] = self.NODE_OCCUPIED
    self.nodes[node + 1] = self.NODE_OCCUPIED
    self.nodes[node + self.width] = self.NODE_OCCUPIED
    self.nodes[node + self.width + 1] = self.NODE_OCCUPIED
    self.tower_pos.append(node)
    return True
 
  def dijkstra(self, start, end):
    q = [(0, start)]  # Heap of (cost, path_head)
    visited = set()
    push, pop, add = heapq.heappush, heapq.heappop, visited.add
    while True:
      if 0 == len(q):
        return False
      (cost, v1) = pop(q)
      if v1 not in visited and self.nodes[v1] != self.NODE_OCCUPIED:
        add(v1)
        if v1 == end:
          return cost
        for v2 in range(self.nnodes):
          if True == self.edges[v1][v2]:
            if v2 not in visited:
              push(q, (cost + 1, v2))

  def cc(self):
    visited = self.visited
    for count in range(self.nnodes):
      if (self.visited[count] == False):
        yield self.dfs(count)

  def dfs(self, start, target):
    return self.__run_dfs(start, target)

  def __run_dfs(self, i, target):
    self.visited[i] = True
    if i == target: return True
    for neighbor in range(self.nnodes):
      if True == self.edges[i][neighbor] and False == self.visited[neighbor]:
        self.__run_dfs(neighbor, target)
    return False

bestpath.py
import sys
import random
from dtd_graph import Graph

def scramble(g):
  best_path = 0
  nodes = []
  
  while True:
    g.reset()
    for _ in range(68):
      node = random.randint(0, g.width * g.height)
      while False == g.place_tower_on_node(node):
        node = random.randint(0, g.width * g.height)

    vpath = g.dijkstra(g.vstart, g.vend)
    hpath = g.dijkstra(g.hstart, g.hend)
    if False == vpath or False == hpath:
      continue
    if best_path < vpath + hpath:
      best_path = vpath + hpath
      g.display()
      print vpath, hpath

def run():
  g_width = 26
  g_height = 22
  g = Graph(g_width, g_height)
  #g.display_edges()
  scramble(g)

def main():
  if len(sys.argv) != 1:
    print "usage:  python", sys.argv[0], "< filename"
    return -1
  else:
    run()

if __name__ == '__main__':
  sys.exit(main())

Sunday, January 17, 2010

On-Disk C Graph Library

This is a small on-disk graph library, which I find very convenient whenever I am dealing with huge graphs.
Back a year or so ago, I could not find any disk-based graph library, which was the motivation for the present code.
It uses gdbm for persistent data storage, meaning that the graph size is only limited by your storage space, but beware, as operations on huge graphs will take a long time.
You should either compile this code with a Linux variant or compatible.
Adjacency lists are kept sorted and the look-up uses a binary search.
I needed to create a Graph that could scale and run a DFS on it, that's the reason why DFS is the only algorithm available to walk the graph.
There is this function bpresearch taken from Minix.
Everything else is GPLv3.
I've used it on several projects successfully, but it is alpha code so use carefully if at all, and you should probably rewrite bpresearch if you want to work with a GPLv3 licence.

graph.h:
#define GRAPH_EDGES_FILE "graph_edges.dbf"
#define GRAPH_VISITED_FILE "graph_visited.dbf"
typedef struct graph{
  unsigned long int nnodes;
  unsigned long int nedges;
  GDBM_FILE edges;
  GDBM_FILE visited;
  size_t size; // size of one data element
  int (*compar)(const void* n1, const void* n2);
} graph_t;
/* Call this function if you want to create a new graph */
int graph_init (graph_t* g_ptr, const char* edges_file, size_t size, int (*compar)(const void* n1, const void* n2));
int graph_restore (graph_t* g_ptr, const char* edges_file, size_t size, int (*compar)(const void* n1, const void* n2));
void graph_close (graph_t* g_ptr);
int graph_node_exists (graph_t* g_ptr, void* n);
int graph_edge_exists (graph_t* g_ptr, void* n1, void* n2);
int graph_add_node (graph_t* g_ptr, void* n);
int graph_add_edge (graph_t* g_ptr, void* n1, void* n2);
int graph_del_node (graph_t* g_ptr, void* n);
int graph_del_edge (graph_t* g_ptr, void* n1, void* n2);
void* graph_first_node (graph_t* g_ptr);
void* graph_next_node (graph_t* g_ptr, void* n);
/* calls @process on every node of the search */
void graph_dfs (graph_t* g_ptr, const void* n, void (*process)(const void* n));
/* calls @process_first on the first node of every component and calls @process_all on every node. The first node is passed to both functions */
void graph_cc (graph_t* g_ptr, void (*process_first)(const void* n), void (*process_all)(const void* n));

graph.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <gdbm.h>
#include "graph.h"
int graph_init (graph_t* g_ptr, const char* edges_file, size_t size, int (*compar)(const void* n1, const void* n2)){
  g_ptr->nnodes = 0L;
  g_ptr->nedges = 0L;
  g_ptr->edges = gdbm_open ((char*)edges_file, 0/* use default */, GDBM_NEWDB|GDBM_SYNC, 0600, 0);
  g_ptr->visited = gdbm_open (GRAPH_VISITED_FILE, 0/* use default */, GDBM_NEWDB|GDBM_SYNC, 0600, 0);
  if (!g_ptr->edges || !g_ptr->visited){
    fprintf (stderr, "(EE) Failed creating graph file(s)\n");
    return (-1);
  }
  g_ptr->size = size;
  g_ptr->compar = compar;
  return 0;
}
int graph_restore (graph_t* g_ptr, const char* edges_file, size_t size, int (*compar)(const void* n1, const void* n2)){
  g_ptr->edges = gdbm_open ((char*)edges_file, 0/* use default */, GDBM_WRITER|GDBM_SYNC, 0600, 0);
  g_ptr->visited = gdbm_open (GRAPH_VISITED_FILE, 0/* use default */, GDBM_NEWDB|GDBM_SYNC, 0600, 0);
  g_ptr->size = size;
  if (!g_ptr->edges || !g_ptr->visited){
    fprintf (stderr, "(EE) Failed opening graph file(s)\n");
    return (-1);
  }
  unsigned long int nnodes = 0L;
  unsigned long int nedges = 0L;
  datum key = gdbm_firstkey (g_ptr->edges);
  datum data;
  while (key.dptr){
    data = gdbm_fetch (g_ptr->edges, key);
    datum nextkey = gdbm_nextkey (g_ptr->edges, key);
    nnodes += 1;
    nedges += data.dsize / g_ptr->size;
    free (key.dptr);
    free (data.dptr);
    key = nextkey;
  }
  g_ptr->nnodes = nnodes;
  g_ptr->nedges = nedges;
  g_ptr->size = size;
  g_ptr->compar = compar;
  return 0;
}
void graph_close (graph_t* g_ptr){
  gdbm_close (g_ptr->edges);
  gdbm_close (g_ptr->visited);
  unlink (GRAPH_VISITED_FILE);
}
int graph_node_exists (graph_t* g_ptr, void* n){
  datum key = {(char*)n, g_ptr->size};
  return gdbm_exists (g_ptr->edges, key);
}
int graph_edge_exists (graph_t* g_ptr, void* n1, void* n2){
  if (!graph_node_exists (g_ptr, n1) || !graph_node_exists (g_ptr, n2))
    return 0;
  int result;
  datum key = {(char*)n1, g_ptr->size};
  datum data = gdbm_fetch (g_ptr->edges, key);
  void* edges = (void*)data.dptr;
  size_t len = data.dsize / g_ptr->size;
  if (bsearch (n2, edges, len, g_ptr->size, g_ptr->compar))
    result = 1;
  else
    result = 0;
  if (data.dsize)
    free (edges);
  return result;
}
int graph_add_node (graph_t* g_ptr, void* n){
  datum key = {(char*)n, g_ptr->size};
  datum data = {"", 1};
  int result = gdbm_store (g_ptr->edges, key, data, GDBM_INSERT);
  if (0 == result)
    g_ptr->nnodes += 1;
  return result;
}
/*
 * Searches an array of nmemb objects, the initial
 * member of which is pointed to by base, for a member that would follow the
 * object pointed to by key.
 * Modified from Minix bsearch
 * (c) copyright 1987 by the Vrije Universiteit, Amsterdam, The Netherlands.
 * See the copyright notice in the ACK home directory, in the file "Copyright".
 * $Header: /opt/proj/minix/cvsroot/src/lib/ansi/bsearch.c,v 1.1.1.1 2005/04/21 14:56:04 beng Exp $
 */
static void* bpresearch (const void* key, const void* base, size_t nmemb, size_t size, int (*compar)(const void*, const void*)){
  register const void* mid_point = base;
  int moreflag = 0;
  while (nmemb > 0){
    mid_point = (char*)base + size * (nmemb >> 1);
    if (compar (key, mid_point) > 0){
      base  = (char*)mid_point + size;
      nmemb = (nmemb - 1) >> 1;
      moreflag = 1;
    }else{
      nmemb >>= 1;
      moreflag = 0;
    }
  }
  return (void*)((char*)mid_point + moreflag * size);
}
static void insert_sorted (void* dest, const void* key, const void* base, size_t nmemb, size_t size, int (*compar)(const void*, const void*)){
  void* follow = bpresearch (key, base, nmemb, size, compar);
  const char* bse = base;
  char* dst = dest;
  size_t first_half = (char*)follow - bse;
  size_t second_half = (bse + nmemb * size) - (char*)follow;
  memcpy (dst, bse, first_half);
  memcpy (dst + first_half, key, size);
  memcpy (dst + first_half + size, bse + first_half, second_half);
}
int graph_add_edge (graph_t* g_ptr, void* n1, void* n2){
  if (0 == memcmp (n1, n2, g_ptr->size)
  || !graph_node_exists (g_ptr, n1) 
  || !graph_node_exists (g_ptr, n2) 
  || graph_edge_exists (g_ptr, n1, n2))
    return -1;
  datum key = {(char*)n1, g_ptr->size};
  datum data = gdbm_fetch (g_ptr->edges, key);
  void* edges = (void*)data.dptr;
  size_t new_size = data.dsize + g_ptr->size;
  char new_edges[new_size];
  insert_sorted (new_edges, n2, edges, data.dsize/g_ptr->size, g_ptr->size, g_ptr->compar);
  datum new_data = {new_edges, new_size};
  if (data.dsize)
    free (edges);
  int result = gdbm_store (g_ptr->edges, key, new_data, GDBM_REPLACE);
  if (0 == result)
    g_ptr->nedges += 1;
  return result;
}
int graph_del_node (graph_t* g_ptr, void* n){
  if (!graph_node_exists (g_ptr, n))
    return -1;
  datum key = {(char*)n, g_ptr->size};
  int result = gdbm_delete (g_ptr->edges, key);
  if (0 == result)
    g_ptr->nnodes -= 1;
  return result;
}
static void remove_sorted (void* dest, const void* key, const void* base, size_t nmemb, size_t size, int (*compar)(const void*, const void*)){
  void* ptr = bsearch (key, base, nmemb, size, compar);
  if (!ptr) return;
  const char* bse = base;
  char* dst = dest;
  size_t first_half = (char*)ptr - bse;
  size_t second_half = (bse + nmemb * size) - ((char*)ptr + size);
  memcpy (dst, bse, first_half);
  memcpy (dst + first_half, bse + first_half + size, second_half);
}
int graph_del_edge (graph_t* g_ptr, void* n1, void* n2){
  if (!graph_edge_exists (g_ptr, n1, n2))
    return -1;
  datum key = {(char*)n1, g_ptr->size};
  datum data = gdbm_fetch (g_ptr->edges, key);
  void* edges = (void*)data.dptr;
  size_t new_size = data.dsize - g_ptr->size;
  datum new_data;
  if (0 == new_size){
    new_data.dptr = "";
    new_data.dsize = 1;
  }else{
    char new_edges[new_size];
    remove_sorted (new_edges, n2, edges, data.dsize/g_ptr->size, g_ptr->size, g_ptr->compar);
    new_data.dptr = new_edges;
    new_data.dsize = new_size;
  }
  if (data.dsize)
    free (edges);
  int result = gdbm_store (g_ptr->edges, key, new_data, GDBM_REPLACE);
  if (0 == result)
    g_ptr->nedges -= 1;
  return result;
}
void* graph_first_node (graph_t* g_ptr){
  datum key = gdbm_firstkey (g_ptr->edges);
  return (void*)key.dptr;
}
void* graph_next_node (graph_t* g_ptr, void* n){
  datum prevkey = {(char*)n, g_ptr->size};
  datum key = gdbm_nextkey (g_ptr->edges, prevkey);
  return (void*)key.dptr;
}
static int is_visited (graph_t* g_ptr, const void* n){
  datum key = {(char*)n, g_ptr->size};
  return gdbm_exists (g_ptr->visited, key);
}
static void mark_visited (graph_t* g_ptr, const void* n){
  datum key = {(char*)n, g_ptr->size};
  datum data = {"", 1};
  gdbm_store (g_ptr->visited, key, data, GDBM_REPLACE);
}
static void reset_visited (graph_t* g_ptr){
  gdbm_close (g_ptr->visited);
  g_ptr->visited = gdbm_open (GRAPH_VISITED_FILE, 0/* use default */, GDBM_NEWDB|GDBM_SYNC, 0600, 0);
}
static void dfs (graph_t* g_ptr, const void* n, void (*process)(const void* n)){
  mark_visited (g_ptr, n);
  process (n);
  datum key = {(char*)n, g_ptr->size};
  datum data = gdbm_fetch (g_ptr->edges, key);
  void* edges = (void*)data.dptr;
  size_t len = data.dsize / g_ptr->size;
  for (int i=0; i<len; i+=1){
    if (!is_visited (g_ptr, edges)){
      dfs (g_ptr, edges, process);
      edges = (char*)edges + g_ptr->size;
    }
  }
  if (data.dptr)
    free (data.dptr);
}
void graph_dfs (graph_t* g_ptr, const void* n, void (*process)(const void* n)){
  reset_visited (g_ptr);
  dfs (g_ptr, n, process);
}
/* Iterate the graph and find connected components
The function @process_first is called on the first node of a connected component - this node can be any node of the component.
@process_all is called on every other node
*/
void graph_cc (graph_t* g_ptr, void (*process_first)(const void* n), void (*process_all)(const void* n)){
  reset_visited (g_ptr);
  void* node = graph_first_node (g_ptr);
  while (node){
    void* nextnode = graph_next_node (g_ptr, node);
    if (!is_visited (g_ptr, node)){
      process_first (node);
      dfs (g_ptr, node, process_all);
    }
    free (node);
    node = nextnode;
  }
}

Monday, January 11, 2010

9 Ways to Iterate Over a Directory in PHP

Recently, I bent over trying to optimize a straight file lookup in PHP. Like most things PHP, you have a wide variety of choices to iterate over a directory. This post goes over their respective efficiencies and implementations. The results presented here will vary depending on your host environment.


Problem Description:

The problem is to quickly match directory entries to a given string using PHP. The string in question is usually part of a file name or even a regular expression. Some of the code presented here does not allow regular expression matching, so take this comparison with a grain of salt. For the comparison itself, I am using the strpos function. This simply means that we will be looking for substrings of a filename, which is usually good enough when looking for a file, auto-completing a name...etc  


Approaches:
  1. Standard scan

    $dirp = opendir('.');
    if($dirp){
      $counter = 0;
      while(FALSE !== ($file = readdir($dirp))){
        if(FALSE !== strpos($file, $query)){
          $counter += 1;
          if($counter > $max_hits)
            break;
        }
      }
      closedir($dirp);
    } 
     
    Pros: Scales well
    This is the standard way of iterating over a directory.
    It uses little memory.
    This code is intuitive,

    Cons: It is awfully close to the metal.
    The fact that you need to close the directory handle is another inconvenience.

  2. The glob function:

    $files = glob("*$query*", GLOB_NOSORT);

    Pros: Very handy function that allows for compact code.

    Cons: It does not scale, therefore avoid for anything but toying around, or if you're absolutely certain that your directory contains few files.

  3. SPL GlobIterator:

    $iterator = new GlobIterator("*$query*", FilesystemIterator::KEY_AS_FILENAME);
    $counter = 0;
    foreach($iterator as $item){
      $counter += 1;
      if($counter > $max_hits)
        break;
    }
    

    Note: The GlobIterator class is a PHP 5.3.0 addition to the Standard PHP Library.
     
    Pros: Scales well. Convenient and efficient.
    The code is compact and readable.

    Cons: Forces a OO approach

  4. SPL DirectoryIterator:

    $iterator = new DirectoryIterator(".");
    $counter = 0;
    if($iterator){
      foreach($iterator as $item){
        if(FALSE !== strpos($item->getFilename(), $query)){
          $counter += 1;
          if($counter > $max_hits)
            break;
        }
      }
    }
    

    Note: This is the SPL scandir.
    This class was introduced earlier than GlobIterator.

    Pros: Scales well
    Besides being much more convenient to use than a regular scandir, it also gives access to individual files information, provides good caching, and is easier to modify.

    Cons: Slower than GlobIterator

  5. Extending SPL DirectoryIterator:

    class DirectoryReader extends DirectoryIterator{
      // constructor.. duh!
      function __construct($path, $query){
        $this->query = $query;
        /*** pass the $path off to the parent class constructor ***/
        parent::__construct($path);
      }
      /*** members are only valid if they match the query ***/
      function valid(){
        if(parent::valid()){
          if(FALSE === strpos(parent::getFilename(), $this->query)){
            parent::next();
            return $this->valid();
          }
          return TRUE;
        }
        return FALSE;
      }
      private $query = NULL;
    } // end class
    $iterator = new DirectoryReader(".", $query);
    $counter = 0;
    while($iterator->valid()){
      $counter += 1;
      if($counter > $max_hits)
        break;
      $iterator->next();
    }
    Note: This is the OO way of customizing your iterator.

    Pros: Scales well
    Readable, makes use of iterators

    Cons: Slow

  6. Flat Index File:

    $fp = fopen('index.txt', 'r');
    $counter = 0;
    if($fp){
      while($line = fgets($fp)){
        if(FALSE !== strpos($line, $query)){
          $counter += 1;
          if($counter > $max_hits)
            break;
        }
      }
      fclose($fp);
    }
    
    Note: This is definitely an optimization trick.
    Internally, PHP uses the php_stream* functions throughout, and is eventually bound to issuing a multitude of system calls to navigating the file-system data structure, which is likely, in many instances, to be slower than reading through a single file.

    Pros: Scales well.
    Gets rid of a lot of overhead.

    Cons: Requires mostly static data as the index needs to be rebuilt on data change.

  7. Forcing Cache on Flat index file:

    $entries = file('index.txt', FILE_SKIP_EMPTY_LINES);
    $counter = 0;
    if($entries){
      foreach($entries as $entry){
        if(FALSE !== strpos($entry, $query)){
          $counter += 1;
          if($counter > $max_hits)
            break;
        }
      }
    } 
    

    Pros: Gets rid of a lot of overhead.

    Cons: Does not scale.
    Slower than 6).
    Trying to outsmart your compiler is usually not a good idea.

  8. Flat Index File and PREG:

    $fp = fopen('index.txt', 'r');
    $counter = 0;
    if($fp){
      while($line = fgets($fp)){
        if(0 !== preg_match("/$query/", $line)){
          $counter += 1;
          if($counter > $max_hits)
            break;
        }
      }
      fclose($fp);
    }
    

    Note: The difference here is in the matching function.
    It gives more flexibility, while keeping a cache of compiled regular expressions.

  9. Search Tree:
    Create look-up directory:

    if(2 !== $argc){
      fwrite(STDERR, "Usage: {$argv[0]} <source directory>\n");
      exit(1);
    } 
    function process_substr($str, $orig_str){
      $base_dir = getcwd();
      $chars = preg_split('//', $str, -1, PREG_SPLIT_NO_EMPTY);
      $path = implode('/', $chars);
      @mkdir($path, 0777, TRUE);
      chdir($path);
      if(!file_exists('results.dat'))
        $results = array();
      else
        $results = unserialize(file_get_contents('results.dat'));
      if(!isset($results[$orig_str]))
        $results[$orig_str] = TRUE;
      file_put_contents("results.dat", serialize($results));
      chdir($base_dir);
    }
    
    function substrings($str, $callback, $min_len=3){
      $str_len = strlen($str);
      for($len = $min_len; $len < $str_len-$min_len; $len += 1){
        for($start = 0; $start <= $str_len-$len; $start += 1){
          $callback(substr($str, $start, $len), $str);
        }
      }
    }
    $source_dir = $argv[1];
    $dirp = opendir($source_dir);
    if($dirp){
      while(FALSE !== ($entry = readdir($dirp))){
        if(strlen($entry) >= 3)
          substrings($entry, 'process_substr');
      }
      closedir($dirp);
    }else{
      fwrite(STDERR, "Cannot open supplied directory\n");
      exit(2);
    }

    Look up a string:
    $query = 'foo';
    $chars = preg_split('//', $query, -1, PREG_SPLIT_NO_EMPTY);
    $path = implode('/', $chars);
    $counter = count(unserialize(file_get_contents("$path/results.dat")));
    

    Notes: In this example, the Search Tree is stored straight to the file system.
    There is nothing wrong with this but you have to be very careful since there are some gotchas.
    Prefer a DBMS in real implementations.

    Pros: Very fast.
    Scales well.
     
    Cons: Takes a lot of space.
    Requires mostly static data.
    Takes some effort to set up.

Some Timings:

Tests are run using PHP 6.0.0 built: Jan 11 2010 19:09:25, Linux and ReiserFS v3 partition

- For a directory containing  125833 entries

$max_hits = 1000000;
$query = 'foo';

Opendir: 23 matches in 0.22206497192383 us
Glob: 23 matches in 0.091760158538818 us
GlobIterator: 23 matches in 0.091748952865601 us
DirectoryReader: 23 matches in 0.58006715774536 us
DirectoryIterator: 23 matches in 0.27896213531494 us
Flat File: 23 matches in 0.12962293624878 us
Cached Flat File: 23 matches in 0.13482689857483 us
Flat File and PREG: 23 matches in 0.30407810211182 us

$max_hits = 11;
$query = 'foo';

Opendir: 11 matches in 0.076170921325684 us
Glob: 23 matches in 0.092523097991943 us
GlobIterator: 11 matches in 0.09209680557251 us
DirectoryReader: 11 matches in 0.19307088851929 us
DirectoryIterator: 11 matches in 0.092113971710205 us
Flat File: 11 matches in 0.025264978408813 us
Cached Flat File: 11 matches in 0.079818964004517 us
Flat File and PREG: 11 matches in 0.060965061187744 us

$max_hits = 11;
$query = 'foo..............................';

Opendir: 0 matches in 0.17899489402771 us
Glob: 0 matches in 0.10579991340637 us
GlobIterator: 0 matches in 0.10554003715515 us
DirectoryReader: 0 matches in 0.58912897109985 us
DirectoryIterator: 0 matches in 0.25362181663513 us
Flat File: 0 matches in 0.12676692008972 us
Cached Flat File: 0 matches in 0.1414749622345 us
Flat File and PREG: 0 matches in 0.31661105155945 us

$max_hits = 501;
$query = 'san';

Opendir: 501 matches in 0.18119215965271 us
Glob: 646 matches in 0.095612049102783 us
GlobIterator: 501 matches in 0.09652304649353 us
DirectoryReader: 501 matches in 0.40601682662964 us
DirectoryIterator: 501 matches in 0.21287703514099 us
Flat File: 501 matches in 0.083136081695557 us
Cached Flat File: 501 matches in 0.1099488735199 us
Flat File and PREG: 501 matches in 0.21364402770996 us

- For a directory containing 4869 entries

$max_hits = 11;
$query = 'foo';

Opendir: 12 matches in 0.010089159011841 us
Glob: 14 matches in 0.0035510063171387 us
GlobIterator: 12 matches in 0.0045490264892578 us
DirectoryReader: 12 matches in 0.027838945388794 us
DirectoryIterator: 12 matches in 0.0090799331665039 us
Flat File: 12 matches in 0.0038931369781494 us
Cached Flat File: 12 matches in 0.0040509700775146 us
Flat File and PREG: 12 matches in 0.029109001159668 us
Search Tree: 14 matches in 0.00045108795166016 us

Notes:
Some of the approaches cannot be tweaked (e.g. glob).
Also, using a Search Tree is only possible when dealing with medium to small size data as the look-up tree will take a lot of space.
The Search Tree preformance is only influenced by 1)The size of your alphabet 2)The Depth of the tree, hence the number of hits and query length are irrelevant.

Conclusion:

If you have no idea what to pick, and if PHP 5 >= 5.3.0, use SPL GlobIterator(3).
The long answer is that (as it is so often the case) it depends on your requirements: number of hits, length of the input string. In order of preference,
  • If your data set is small, mostly static AND efficiency is a issue, use a Search Tree(9)
  • If your data is mostly static but too big to build a look-up tree, and the required number of hits is low, a Flat Index File(6) is likely to improve lookup time.
  • In every other cases, and if available, use SPL GlobIterator(3)
  • SPL DirectoryIterator(4) is fine but avoid extending it
  • Standard Scan(1) is fine but use it as a last resort

References:
PHP Source Code
Scandir Function
Standard PHP Library (SPL)
SPL-Standard PHPLibrary
Introduction to Standard PHP Library (SPL)


Please don't hesitate to flame on any nonsense that I may have written