lach's math notes for programmers

Significant parts of the site require JavaScript. If you have it turned off and something doesn't work, that's probably why!

lachrymal

14 posts

Original Poster

joined 8 months ago

3 days ago (edited 2 days ago) #19634

1. Transitions

We've got some sprites for a character in an RPG game that kp is working on for Ludum Dare, say, we have 27 of them for 27 states. Every state can transition to any other, and we have animations for each transition between sprites.

We are bound by memory on this particular project since the game should run on a MEGA65, to save space we've designed the game in such a way so that the reverse transition is the same as the transition animation played in reverse, so that we only have to store it once. (State A -> State B animation is the same as the animation State B -> State A played in reverse.)

Is it possible to find a sequence of consecutive states to store the animations so that every transition is stored exactly once?

Example:
5 States: ABCDE
All transitions to store: AB, AC, AD, AE, BC, BD, BE, CD, CE, DE
Example of a solution sequence: ABCEBDCAEDA

Think about the problem for a bit! Solution: https://i.fii.moe/1UYsl8R5tH46Uh.pdf

See you soon!

lachrymal

14 posts

Original Poster

joined 8 months ago

2 days ago (edited 2 days ago) #19635

2. Parse XDDDD

Let's look at why regex is not a particularly good choice for parsers. Rob Pike has an argument with code style and practicalities, but let's look at it from the expressive power point of view, with an introduction to language theory.

An alphabet Σ is a set of symbols that can be used to form strings -- binary strings have the alphabet Σ={0,1}, for instance. English can be more or less written with ASCII character set, which we can also consider an alphabet.

Σ* denotes the set of all finite strings in the alphabet Σ. Every sequence of characters from Σ is in the set Σ* (in some sense this is the Library of Babel!). A language L over the alphabet Σ is any subset of Σ*. For example:

L={1110101,101010101} is a finite language over Σ={0,1}.
L={woomy, wooomy, woooomy, ...} is an infinite language over Σ={w,o,m,y}.

A recognizer R of the language L is a machine which, when given a string in L, is able to halt and tell you that the string is in fact in L. (R need not halt when given a string that is not in L!) Example: any parser is in fact a recognizer. A recognizer is a crippled parser that doesn't care to tell you the structure of the input!

Since we actually want our parser to terminate, let's only consider recognizers which must terminate (such machines are called deciders), in particular, let's look at the DFA (deterministic finite automaton) -- this is basically what people refer to when they say "state machine".

Here is a state machine over the alphabet Σ={f,l,a,s,h,i}. The machine takes an input string and enters the starting state, to continue the machine, it matches the next character in the input string against any of the arrows out of the current state. If there is no transition possible and the string is not completely consumed, then the machine halts and rejects. If the string is completely consumed and yet we are not at a state that is marked with a double circle (in this case just G), the machine also rejects. Only if the string is completely consumed and we are in a state with a double circle does the machine accept. Importantly, a DFA can only have finitely many states. Can you figure out the language which this machine accepts?

It accepts the language L={flashi, flashii, flashiii,...}, the machine will reject "flash", for example.

But lach why did you go through so many definitions and extra steps, when I can just parse this with \^flashi+$\. Here is a result from formal language:
Theorem. A language L is accepted by some state machine if and only if L is described by a regular expression. In such cases, L is called a regular language.

That is to say, the expressive power of regular expressions is equivalent to that of state machines. This is actually how most efficient regex engines work, they translate your expression into an equivalent state machine.

Well, I'm not saying that regex is bad and you shouldn't ever use them, but here is the problem: regular languages are a very limited set of languages.

Theorem. (Pumping lemma) If L is a regular language, then there exists some p ≥ 1 such that every string w in L of length at least p can be split into three parts as w=a + b + c (concatenation of strings a, b, c), where:

b is not the empty string
(a + b).length ≤ p
a + b + b + ... + b + c is in the language L. That is, we can repeat ("pump") the "b" part of w=a+b+c however many times we wish, and it will remain in the language.

For the regular language L={woomy, wooomy, ...}, we can set a='wo', b='o', c='my'. Then we can generate the entire language by pumping b='o'. Imagine what a state machine looks like for this language:

Notice how "pumping b='o'" is equivalent to going in circles on the self-transition on the state D. In other words, every regular language eventually turns into something like this. This is the main idea of the pumping lemma: since any DFA only has finitely many states, any sufficiently long string in the accepted language must visit some state twice, at which point we can just keep going in circles (the "pumping" action).

Let's look at a language that is not regular, a very simple and common one at that: let's try matching brackets. Our language is L={(), (()), ((())), ...} over the alphabet Σ={(,)}. It's easy to see that there is nothing we can pump here:

If you try to pump some number of '(' or ')', then the number of opening and closing brackets no longer match -- not in L.
If you try to pump some number of '(' concatenated by some number of ')' -- as in the middle of any string in L, then we are not nesting brackets anymore -- a bracket is closed before all brackets have been opened, which is also not in L.

So L is not regular, NO REGEX will be able to parse it! This probably confirms your intuition about regex, that it is very "straight" or "flat". Here you see formally that the parse tree can only expand horizontally for repetitions, but never vertically to allow "nesting" in the traditional sense. (Side note: this simple "nesting" is an example of a "context-free grammar", and it is still a lot weaker than a general recursive grammar, but it still is a good illustration of a simple property a real-world grammar of interest will likely possess.)

Since regex alone is not powerful enough, if you would like to use regex to build a parser, it then must be interleaved with meticulously written control flow, pairing with what the regex describes. This is very brittle and error-prone -- extremely high coupling directly over an extremely barebones API boundary with respect to defining grammars (your programming language's control flow), leading into the arguments from Rob Pike -- since, as the programmer, parsing control flow and parsing regex are essentially two separate activities.

See you soon!