I would like some obscure information

I know that somewhere out there, someone has this information, but I can’t seem to find it.

Imagine you have ten empty shelves. You’re going to fill these shelves with books, and assume that you want to shelve the books alphabetically by author. Further assume that the books you want to shelve are more-or-less randomly distributed by author’s names. Now, the catch is that you want to determine what alphabetical range to assign to each shelf before you start putting the books on.

Presumably the information you need is the distribution of fiction books in English along an axis of author names. Then you could figure out how to divide this into ten intervals that would, on average, contain the same number of books from any pool of randomly selected books of English language fiction.

In computer geek terms, what I want is a hashing function that I can use to assign items to a particular list, so that sorting of the sublists will be easier, and so that subsequent merging of the sublists with a much larger master list would be optimal. In order for the hash function to be optimal, of course, I need to know a lot about the likely distribution of keys–and I don’t have that information.

8 Responses to “I would like some obscure information”

  1. Fred Says:
    1

    I’m not sure how you can possibly determine this beforehand, unless the books are of a uniform size and you can guarantee a letter won’t run on to the next shelf below. I mean, does anyone know the distribution of fiction books in English along an axis of author names? Because, y’know, there’s a lot of books out there. And some letters are statistically more likely for last names than others. Wouldn’t it simply be easier to determine how many books you have, and how those books break down by author name, and then how those books fit on these particular shelves?

    Maybe I’m misunderstanding what it is you’re doing. I’m not sure I get why you’d need to assign a range to each shelf — since that sort of thing is probably going to be self-evident after just a cursory glance at the books in question — much less before you’ve put the books on the shelves.

  2. Mr. McLaren Says:
    2

    Yeah, what I’m actually looking for is a rough distribution of how common author last names are in English. Which is probably just the same thing as a distribution of how common last names are in English.

    The real problem is this: I have around 500 linear feet of hardcover fiction in English, which is currently arranged in alphabetical order by author (with series and chronological order within each author).

    I also have that really big pile of new hardcover fiction books that I want to put away. Since all my current shelving is full, I’ve added two new full-height bookcases (10 shelves, around 30 linear feet) to the library to hold the new stuff. However, the new stuff has to put into the “right” place in alphabetical order.

    Putting them away involves finding the right alphabetical place for each of the new volumes. Since the existing shelves are full, to make a space for the new volume requires shifting all the books after the place where the new will go down a space. Shifting “down a space” can be a real pain in the ass when it means that something goes of the end of a shelf, and the whole next shelf then needs to be pushed down one, which means the next shelf… and so on for 500 feet across well over 50 shelves.

    So, the real trick is to first sort all the new books into the right order, so you can be sure that as you go along inserting them, you can insert them all in a single traversal of the shelves. (I.e. you never have to shift any books to the right more than once.)

    Since I have these two new bookcases that will end up holding Va-Wi across ten shelves, it makes sense to use them to hold the new books while I am sorting them, and doing the traversal.

    And, it would be much easier to sort the pile of new books if I could first split them into ten smaller groups without having to sort. I could (and probably will) just arbitrarily divide the 26 letters into 10 groups, guessing what will make approximately the same sized groupings, but I am enough of a geek to wonder if I could come up with an optimal set of ranges if I knew more about the distribution of author last names.

    Yes. I am insane. I understand this.

  3. Alex Wilson Says:
    3

    You need a page like this:
    http://www.sffworld.com/aname/a.html
    then total up the books column for each letter. That should give you the distributions for your hash function.

    There’s probably a better index out there somewhere, but this is the first good (at least, large enough) one I found.

    Are the 10 empty shelves all of you books? I thought you had more — could you not use your existing collection as a sample?

  4. Alex Wilson Says:
    4

    Duh… should have checked this first…

    Amazon query for author last names starting with ‘a’

    Change the last letter in the URL to cycle through the alphabet. “A” seems to be really skewed, but the counts for the other letters look like they are in line with each other.

  5. Mr. McLaren Says:
    5

    The 10 empty shelves are just to hold the new stuff from the last few months. I’m only trying to hash the new ones onto shelves temporarily to simplify the sorting of the physical objects until they can be merged into the master set.

    I could totally create a series of searches against amazon, each of the form “all book in ‘fiction’ with author last name starting with ‘term‘” and capture the result set size. Term would be one of the set {”aaa”,”aab”,”aac”,…,”aaz”, “aba”, “abb”, …, “zzx”, “zzy”, “zzz”}.

    That would give me 17576 results that I could use to plot relative frequency closely enough to guess.

  6. Mr. McLaren Says:
    6

    So, I decided to do an empirical trial. I decide to use 9 of the shelves, and keep one to use for overflow.

    My division was arbitrary, based on some intuitions about the distribution of last names, as follows:

    • 1) A, B
    • 2) C,D,E
    • 3) F,G,H,I,J
    • 4) K,L,M
    • 5) N,O,P
    • 6) Q,R
    • 7) S
    • 8) T
    • 9) U,V,W,X,Y,Z

    After sorting approximately half the new books, I have some early results.

    Any guess as to which shelf filled up first?

    It was actually roughly a tie between #1 and #4. The combination of As and Bs were roughly equal to the Ms, which seemed to be the clear winner.

    The emptiest shelf is #8, with only three slim T books.

    I hope the rest of the sample doesn’t follow this weighting…

  7. Fred Says:
    7

    I think it’s safe to say I have never brought this level of sophistication and careful study to the shelving of my own books.

  8. Mr. McLaren Says:
    8

    I’m an engineer, so I like to try to find optimal algorithms to solve problems.

    And I’m a collector so I’m crazy.

    It’s like chocolate and peanut butter.

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Quicktags:

The odds are very good that comments submitted with JavaScript turned off will be flagged as spam.