Did you ever saw a char and thought: “Damn, 1 byte for a single char is pretty darn inefficient”? No? Well I did. So what I decided to do instead is to pack 5 chars, convert each char to a 2 digit integer and then concat those 5 2 digit ints together into one big unsigned int and boom, I saved 5 chars using only 4 instead of 5 bytes. The reason this works is, because one unsigned int is a ten digit long number and so I can save one char using 2 digits. In theory you could save 32 different chars using this technique (the first two digits of an unsigned int are 42 and if you dont want to account for a possible 0 in the beginning you end up with 32 chars). If you would decide to use all 10 digits you could save exactly 3 chars. Why should anyone do that? Idk. Is it way to much work to be useful? Yes. Was it funny? Yes.

Anyone whos interested in the code: Heres how I did it in C: https://pastebin.com/hDeHijX6

Yes I know, the code is probably bad, but I do not care. It was just a funny useless idea I had.

  • ChaoticNeutralCzech@feddit.org
    link
    fedilink
    English
    arrow-up
    8
    ·
    15 hours ago
    unsigned int turn_char_to_int(char pChar)
    {
        switch(pChar)
        {
        case 'a':
            return 10;
        case 'b':
            return 11;
        case 'c':
            return 12;
        case 'd':
            return 13;
        case 'e':
            return 14;
        case 'f':
            return 15;
        case 'g':
            return 16;
        case 'h':
            return 17;
        case 'i':
            return 18;
        case 'j':
            return 19;
        case 'k':
            return 20;
        case 'l':
            return 21;
        case 'm':
            return 22;
        case 'n':
            return 23;
        case 'o':
            return 24;
        case 'p':
            return 25;
        case 'q':
            return 26;
        case 'r':
            return 27;
        case 's':
            return 28;
        case 't':
            return 29;
        case 'u':
            return 30;
        case 'v':
            return 31;
        case 'w':
            return 32;
        case 'x':
            return 33;
        case 'y':
            return 34;
        case 'z':
            return 35;
        case ' ':
            return 36;
        case '.':
            return 37;
    
        }
    }
    

    Are you a monster or just stupid?

      • ChaoticNeutralCzech@feddit.org
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        14 hours ago

        If you couldn’t write

        if(pChar >= 'a' && pChar <= 'z') return pChar - ('a' - 10);
        

        I suppose you typed this “all the size of a lookup table with none of the speed” abomination manually too.

        • Zacryon@feddit.org
          link
          fedilink
          arrow-up
          3
          ·
          13 hours ago

          switch case structures are very efficient in c and c++. They work similarly like an offset into memory. Compute the offset once (any expression in the ‘case’ lines), then jump. Using primitives directly, like here with chars, is directly the offset. Contrary to if-else branches, where each case must be evaluated first and the CPU has basically no-op cycles in the pipeline until the result of the branch is known. If it fails, it proceeds with the next one, waits again etc… (Some CPU architectures might have stuff like speculative branch execution, which can speed this up.)

          However, code-style wise this is really not elegant and something like your proposal or similar would be much better.

          • ChaoticNeutralCzech@feddit.org
            link
            fedilink
            English
            arrow-up
            2
            ·
            edit-2
            10 hours ago

            Oh, I didn’t know that they were a LUT of jump addresses. Stil, a LUT of values would be more space-efficient and likely faster. Also, what if the values are big and sparse, e.g.

            switch (banknoteValue) {
                case 5000:
                    check_uv();
                    check_holograph();
                case 2000:
                    check_stripe();
                case 1000:
                    check_watermark();
            }
            

            …does the compiler make it into an if-else-like machine code instead?

  • Zacryon@feddit.org
    link
    fedilink
    arrow-up
    2
    ·
    16 hours ago

    At first I thought, "How are they going to compress 256 values, i.e. 1 Byte sized data, by “rearranging into integers”?

    Then I saw your code and realized you are discarding 228 of them, effectively reducing the available symbol set by about 89%.

    Speaking of efficiency: Since chars are essentially unsigned integers of size 1 byte and ‘a’ to ‘z’ are values 97 to 122 (decimal, both including) you can greatly simplify your turn_char_to_int method by just substracting 87 from each symbol to get them into your desired value range instead of using this cumbersome switch-case structure. Space (32) and dot (46) would still need special handling though to fit your desired range.

    Bit-encoding your chosen 28 values directly would require 5 bit.

  • python@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    16 hours ago

    Hey, this is awesome for saving space when writing things to NFC tags! Every bit still matters with those suckers

    • enumerator4829@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      13
      ·
      1 day ago

      Lol, using RAM like last century. We have enough L3 cache for a full linux desktop in cache. Git gud and don’t miss it (/s).

      (As an aside, now I want to see a version of puppylinux running entirely in L3 cache)

      • BartyDeCanter@lemmy.sdf.org
        link
        fedilink
        arrow-up
        1
        ·
        5 hours ago

        I decided to take a look and my current CPU has the same L1 as my high school computer had total RAM. And the L3 is the same as the total for the machine I built in college. It should be possible to run a great desktop environment entirely in L3.

    • DacoTaco@lemmy.world
      link
      fedilink
      arrow-up
      4
      ·
      edit-2
      1 day ago

      Cache man, its a fun thing. 32k 32 (derp, 32 not 32k) is a common cache line size. Some compilers realise that your data might be hit often and aligns it to a cache line start to make its access fast and easy. So yes, it might allocate more memory than it should need, but then its to align the data to something like a cache line.
      There is also a hardware reasons that might also be the case. I know the wii’s main processor communicates with the co processor over memory locations that should be 32k aligned because of access speed, not only because of cache. Sometimes, more is less :')

      Hell, might even be a cause of instruction speed that loading and handling 32k of data might be faster than a single byte :').

      Then there is also the minimum heap allocation size that might factor in. Though a 32k minimum memory block seems… Excessive xD

  • sunbeam60@lemmy.ml
    link
    fedilink
    arrow-up
    36
    ·
    edit-2
    1 day ago

    After all… Why not?

    Why shouldn’t I ignore the 100+ cultures whose character set couldn’t fit into this encoding?

  • drath@lemmy.world
    link
    fedilink
    arrow-up
    21
    ·
    edit-2
    1 day ago

    Oh god, please don’t. Just use utf8mb4 like a normal human being, and let the encoding issues finally die out (when microsoft kills code pages). If space is of consideration, just use compression, like gz or something.

  • NigelFrobisher@aussie.zone
    link
    fedilink
    arrow-up
    5
    ·
    1 day ago

    My colleague said he didn’t see the point in storing enums as shorts or bytes instead of a full word, so I retaliated by storing them in their string form instead, arguing that it’ll be compressed by the db engine.

  • traceur301@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    25
    ·
    1 day ago

    I’m not sure if this is the right setting for technical discussion, but as a relative elder of computing I’d like to answer the question in the image earnestly. There’s a few factors squeezing the practicality out of this for almost all applications: processor architectures (like all of them these days) make operating on packed characters take more operations than 8 bit characters so there’s a speed tradeoff (especially considering cache and pipelining). Computers these days are built to handle extremely memory demanding video and 3d workloads and memory usage of text data is basically a blip in comparison. When it comes to actual storage and not in-memory representation, compression algorithms typically perform better than just packing each character into fewer bits. You’d need to be in a pretty specific niche for this technique to come in handy again, for better or for worse

    • gusgalarnyk@lemmy.world
      link
      fedilink
      arrow-up
      6
      ·
      1 day ago

      I liked the technical discussion so thank you. Keep it up, I got into this career because there was always so much to learn.

    • da_cow (she/her)@feddit.orgOP
      link
      fedilink
      arrow-up
      6
      ·
      1 day ago

      This is 100% true. I never plan on actually using this. It might be useful for working on microcontrollers like an ESP32, but apart from that the trade of for more computational power is not worth the memory savings.

      • rustyricotta@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        2
        ·
        23 hours ago

        Having seen many of Kaze’s videos on N64 development, I’ve learned that the N64 has like 4x the processing power it needs compared to its memory. In hardware cases like that the trade-off of computational power and memory memory savings gets you some nice performance gains.

    • nik9000@programming.dev
      link
      fedilink
      arrow-up
      2
      arrow-down
      1
      ·
      1 day ago

      There is still fun to be had! Just… Different fun!

      In database land lookup tables are pretty common. Prefix tries and the like are super common in search land. I’ve seen GCD, offset, delta-of-delta, and some funky bitwise floating point compression used. Sometimes just to save dist space. But usually to save working set space or IO or S3 cache space.

      And squeezing the most out of modern CPUs is its own art. Compilers are glorious. And modern CPUs are magic lightning rocks. But you can learn to sing to them just right to make them all happy.