Příklad #4 — Učení z textu (Markovův řetězec)

V příkladu 2 jsme pravidla psali ručně. Tady se model učí sám z textu: spočítá, co po čem nejčastěji následuje, a podle těch pravděpodobností generuje nový text. Je to princip „malého jazykového modelu" — a posuvníkem kontextu uvidíš, jak víc paměti = víc smyslu. In example 2 we wrote the rules by hand. Here the model learns by itself from text: it counts what most often follows what, and generates new text according to those probabilities. It's the principle of a "small language model" — and with the context slider you'll see how more memory = more sense.

Co a jak?What & how?

Co to jeWhat it is Markovův řetězec — jednoduchý model, který se sám naučí z libovolného textu, jaké slovo nejčastěji následuje po jiném, a podle těch pravděpodobností pak skládá text nový.A Markov chain — a simple model that learns by itself, from any text, which word most often follows another, and then builds new text from those probabilities.

Co zkusitWhat to try Vlož vlastní text (nebo nech ukázkový) a generuj. Posuvníkem kontextu měň, kolik předchozích slov si model pamatuje — uvidíš, jak z blábolu vzniká skoro smysluplná čeština.Paste your own text (or keep the sample) and generate. Use the context slider to change how many previous words the model remembers — watch gibberish turn into almost-meaningful language.

Proč je to důležitéWhy it matters V příkladu 2 jsme pravidla psali ručně; tady se model učí sám z dat. To je princip „malého jazykového modelu" — a víc kontextu (paměti) = víc smyslu, přesně jako u dnešních velkých modelů.In example 2 we wrote the rules by hand; here the model learns by itself from data. This is the principle of a “small language model” — and more context (memory) = more meaning, exactly like in today’s large models.

Jak to funguje a kde jsou hraniceHow it works and where the limits are

Markovův řetězec je nejjednodušší jazykový model. Nerozumí významu — jen si z textu spočítá statistiku: „po této kombinaci nejčastěji následuje tohle". Pak generuje tak, že podle těch četností náhodně losuje další znak nebo slovo. Čím delší trénovací text, tím bohatší a věrohodnější výsledek. A Markov chain is the simplest language model. It doesn't understand meaning — it just computes statistics from the text: "this is what most often follows this combination". Then it generates by randomly drawing the next character or word according to those frequencies. The longer the training text, the richer and more believable the result.

Jak se model „učí"How the model "learns"

Projde text a u každého kousku si zapíše, co po něm následovalo. Z toho vznikne tabulka pravděpodobností — třeba že po "ko" bývá nejčastěji "č", "u" nebo "s". It goes through the text and for each piece records what followed it. That builds a table of probabilities — for example, that "th" is most often followed by "e", "a" or "i".

Žádná pravidla se nepíšou ručně (na rozdíl od příkladu 2) — vyčtou se z dat. No rules are written by hand (unlike example 2) — they're read from the data.

Řád n = kolik si pamatujeOrder n = how much it remembers

Řád určuje, kolik předchozích jednotek model bere v potaz. To je celé kouzlo posuvníku: The order sets how many previous units the model takes into account. That's the whole magic of the slider:

– nízký řád (1–2): málo kontextu → blábol, ale „kreativní"low order (1–2): little context → gibberish, but "creative"

+ vyšší řád (3–5): víc kontextu → skoro češtinahigher order (3–5): more context → almost-real language

– moc vysoký: jen opisuje věty z předlohytoo high: it just copies sentences from the source

Po znacích vs. po slovechBy characters vs. by words

Znaky: učí se i pravopis a tvar slov — umí vymyslet nová slova, která vypadají česky, ale nemusí existovat. Characters: it also learns spelling and word shapes — it can invent new words that look real but may not exist.

Slova: skládá jen existující slova, věty plynou líp, ale potřebuje delší text, jinak jen opakuje předlohu. Words: it assembles only existing words, sentences flow better, but it needs a longer text or it just repeats the source.

Vztah k dnešním LLMRelation to today's LLMs

Princip je stejný: předpovědět další kousek textu z předchozího kontextu. Velké modely (ChatGPT) to ale dělají neuronovou sítí, s kontextem tisíců slov a naučené z miliard vět. The principle is the same: predict the next piece of text from the previous context. But large models (ChatGPT) do it with a neural network, with a context of thousands of words and trained on billions of sentences.

Proto „rozumí" mnohem víc — ale jádro myšlenky vidíš právě tady. That's why they "understand" much more — but you can see the core idea right here.

Pointa: i tahle pár řádků dlouhá „inteligence" dokáže napodobit jazyk překvapivě dobře — a přitom vůbec nechápe, o čem text je. Je to čistá statistika nad daty. Velké jazykové modely jsou v principu totéž, jen nesrovnatelně větší. The point: even this few-lines-long "intelligence" can mimic language surprisingly well — while not understanding at all what the text is about. It's pure statistics over data. Large language models are in principle the same, just incomparably bigger.

04Učení z textu — Markovův řetězecLearning from text — a Markov chain

Co a jak?What & how?

Jak se model „učí"How the model "learns"

Řád n = kolik si pamatujeOrder n = how much it remembers

Po znacích vs. po slovechBy characters vs. by words

Vztah k dnešním LLMRelation to today's LLMs