DISQUS

Phil Dawes' Stuff: Indexes, Hashes & Compression

  • pigalle · 2 years ago
    re: optimal storage / read efficiency - have you tried reiser4? it does a wonderful job of not wasting disk space. a 'du -k' inside a dir used roughly the same amount of total space as a n3 serialization of the same data. said ~30 mb of data took up 230 mb on ext3. and, about one in every 5 triples is a blog post / news story text where theres a 5K chunk of text - the difference would be even more absurd if not for that. also read back is much faster than your numbers would suggest - its nowhere near 10 ms per call. what kind of drive are you using a 423 mb thing you found in a discared PC on the street?

    as for 'in memory' - the kernel disk cache is a great for 'in memory' - especially in the concurrency department - 10 mongrels can all benefit from it w/o a seperate memcached..

    as for indexing - i havent thought about it much yet - my query engine takes about 0.1 seconds for a basic 'fetch the content, title, author, date, abstract of ___ resources sorted by ascending date'.. hopefully that can be shaved down once i learn some stuff, and your previous post is my jumping off point - thanks!

    oh ya. wheres your source? mines http://whats-your.name/yard
  • Seth Ladd · 2 years ago
    I've had good experience storing my data in columns. If I sort the data in each column, I'll get very good compression.
  • Phil Dawes · 2 years ago
    @pigalle: thanks for the comments - I'll take a look at reiser4.
    The ~10ms latency is for a disk seek not a read. What sort of timings are you getting?

    @Seth: Cool - I'm planning on doing the same thing (have you read the research papers for cstore?).