User: Password:
Subscribe / Log in / New account

Algorithms for loading history file

Algorithms for loading history file

Posted Jun 1, 2005 0:27 UTC (Wed) by Blaisorblade (guest, #25465)
In reply to: Fish - The friendly interactive shell by liljencrantz
Parent article: Fish - The friendly interactive shell

> The history file takes about a quarter of a second to load 1000 entries, > though. That has to improve.
About this: since this is something smart algorithms can help about (I'm not sure about load time, but at least for search time), I wanted to answer. Have you thought to using a trie (also called radix tree)? It's a tree (not binary) where each word is stored split in characters, with one character at each level. So you store "cat a" and "cat b" in the same tree path, except for the last node:

(representing the tree in horizontal)
<c> -> <a> -> <t> -> < > -> <a> | <b>

This helps a lot searches done from the beginning. The root would be an empty node, having all (used) letters as childs.

Actually this wouldn't help for Ctrl-R searches... a solution I have on top of my head (but which isn't on textbooks, so it's not as studied as the above one) is to have a parallel array who contains all letters, and where each letter points to all occurrences of that letter itself in the main tree. So for "a" you find a list of all "a"'s in the main tree. Sadly, it would be better to have a parallel trie of all "a" in the main tree, but this would be hard (either you restore again the whole trie or you all "a"'s share the descendants, which isn't nice).

If you want, we can discuss this further (even if I'm not a deep expert on this)...

(Log in to post comments)

Algorithms for loading history file

Posted Jun 1, 2005 11:11 UTC (Wed) by liljencrantz (guest, #28458) [Link]

The history search uses a linear search right now, since the history list is so small that searches take virtually no time at all. Making the search use a more clever algorithm would just increase code bloat without adding any noticable benefit to the user.

The thing that _does_ take time is reading the history file from disk. I haven't looked in to whether this is because of disk latency, because fish does mutiple mallocs for every entry, meaning several thousand mallocs are performed when loading into memory or if this is caused by using a slow hash function in the table used for removing duplicates. The hash function fish uses is pretty slow, since it does quite a lot of work. It is designed in much the same way as the SHA family of cryptographic functions, but with fewer laps and a smaller data set. The upside is that the distribution of the hash values is very good. The downside is that the hash function is a bit slow.

Algorithms for loading history file

Posted Jun 1, 2005 16:15 UTC (Wed) by Blaisorblade (guest, #25465) [Link]

Ok, probably the search algorithm is ok...
For the hash function, SHA1 has almost 0 probability of collision and gives a 160 bits hash; to avoid collisions between around 5000 entries, it's really overkill. Actually a good hash function for non-cryptographic purposes (assuming you go compare entries with the same hash) is of the followind kind:

int hash(char a[], int size, int prime) {
  char * p = a;
  int sum = 0;
  while (*p) {
    sum = sum * prime;
    sum = (sum + *p++) % size;
    //printf("%d ", sum);
  return sum;

it returns an integer in the range [0..size); (and prime must be a prime number, which usually you choose at compile time). prime=131 is a good choice. It is easy to add this code to build a testing program with the below (send size and prime on the first line, then a test file, and then sort|uniq -c|count the average number of collisions on used hash codes with awk).

char str[1000];

int main(void) {
  int size, prime;
  scanf("%d %d", &size, & prime);
  while (scanf("%s", str) != EOF) {
    printf("%d\n", hash(str, size, prime));

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds