Saturday 30 May 2009

This is starting to push the bounds of usefulness, I realise, but I applied the same analysis of the most readable verses to chapters.

By the time you get to reading chapters, you need a lot of words, and it is unlikely you are going to worry about having to look up a few in the lexicon. So I realise this is artificial, but still, here are the results. The most readable chapters in the NT are:

* John 17 (the 'High Priestly Prayer' of Jesus before his arrest).

* John 16

* Rev 15 (not surprising - a very short chapter about the seven plagues).

* Rom 14

* 1 John 1 (the start of a book - many of the simplest chapters start books, as much of the language is formulaic - 1 John 1 is also very short).

Now, the thing that is most surprising about this list is just how far John 17 is ahead of the others. It it not a particularly short chapter, but is highly repetitious, and scores only 1000. John 16, coming in second, scores 2216, and then there is another gap before we get lots of chapters around the 3000 mark.

Reading through the greek of John 17, it is remarkably simple.

I think there's probably very little information in the result (no speculating that it was written by a different hand or anything like that), but I do find it interesting. And it is useful to now have a list of chapters that are pretty easy to read and that (importantly) any words I don't already know will be as useful as possible for my future reading.

FYI, the top 20 with scores are:



PositionDifficulty ScoreChapter
11000John 17
22216John 16
32753Rev 15
42893Rom 14
530331John 1
630331John 3
73096John 1
83191Rev 11
93211Rev 17
103314Matt 20
1134171John 2
1234171John 4
1334242Cor 5
1434312Thes 2
153478Rom 5
163563Matt 10
173580Matt 7
183696Rom 6
193781Matt 28
203966Rev 1

Thursday 28 May 2009

So I started playing with Grammatical analyses, as I suggested in previous posts, and I had a go at a first visualization.




It is a simple heat-map of a verb table showing what conjugations of the verb are most common and which are rare. Interestingly a whole slew of conjugations that appear in several grammars and other verb tables I have never appear at all in the NT. The optative future is the main offender, but there are several others.

If you are primarily interested in reading the GNT, this map shows what conjugations to focus on, but there are limitations, as described on the chart.

You can download it in either International A4 or US Letter format. Both are pdf files and both are around 10K in size, so they're tiny.

Wednesday 27 May 2009

In my last post I talked about the number of headwords you'd need to know to start reading the NT, if you learned the words in decreasing order of frequency. (You'd need 33 to read Matt 16:15, as it turns out).

We can continue the same process for all the verses in the bible. For each verse assigning a score based on where in the frequency word list the rarest word occurs. If we order all verses in the bible using this technique we end up with this graph.




Initially in the bottom left you need to learn quite a few words before you can read another verse of the bible (it takes 33 to get to verse one, another 2 words for the next verse, then 15 more before you can read number three). Over time though the payoff begins to show. In the middle of the graph, each new word you learn gives you the ability to read between two and three more verses. Then towards the end we end up in the territory of words that only appear once, and so learning them only helps us read one extra verse.

You can think of this as three zones of learning:

1. Early undergraduate study. Everything is tough - everything seems like a special case, and nothing joins up.

2. Late undergraduate study. Suddenly you find yourself reading bits of the NT with minimal help from a lexicon. It is a great feeling.

3. The point beyond where I am. You can read about 2/3 of the NT without reference, but the words you don't know really are a special case here.

I expected the graph to have roughly this shape (flat initially, then steepening before flattening again), but I thought that the steep slope would be even steeper. It turns out that even at best (with a vocab around 1000 words) you get less than three new verses for every word you learn. On the positive side, however, the initial flat bit where progress is slow is far smaller than I expected, at this scale it is really only the smallest of ticks. After a couple of hundred words, you are into the most productive zone.

This is deeply encouraging for students of the language. Just a couple of hundred words and you can do a lot.

Of course, as I've said several times, this applies to vocab only. At some point I'll do a similar analysis of grammatical features.

PS: In case you're wondering, the little kinks at 2655 and 3495 words are where the words start to appear twice in the NT and once in the NT respectively.

There are a few different types of math that we could do with the NT. If things go well, I might even get round to doing them.

* Firstly I'm going to be doing mostly counting stuff: frequencies, broad patterns, distributions, and so on. These are easy, because they only involve single quantities. We can look at the distribution of a particular verb form across the NT books, for example, or find the easiest verse to read (as in the previous post).

* Secondly we could look at correlations. These are data that are derived from more than one quantity, and specifically look at how they vary in combination with one another. Strictly the distribution of verbs across NT books is a correlation measure, but here I'm particularly thinking about more text-text correlations. We could look at how pairs of unusual greek forms appear together, or determine just how similar are the same pericope in different synoptic accounts.

* Thirdly we can look at models of the texts. This involves determining which of a number of models underlying the texts is more likely. So we could build a model of the synoptics with a hidden source, (Q for example), and look at whether that model is more likely than one with no hidden source. This final step is unlike the previous two because we need to construct and justify our models before we begin. Math can't magically tell us history, but it can help us understand which of several well-defined historical possibilities might be more likely, given a set of explicit (and normally quantified) assumptions.

So I'm aiming for 3, but for a while all my results will be at level 1. Mostly because I'm just getting use to the text and starting to write the code that I'm using to do these calculations. At some point between 2 or 3 I suspect I'll also need to learn a good deal more textual criticism too, but then I'm in no rush to get anywhere with this.

Tuesday 26 May 2009

Following on from my previous post, I got to thinking about a "mathematically optimal greek course". One that casts aside any concept of pedagogy and optimizes for a hypothetical learning machine.

We'd want to teach this machine its vocab in roughly decreasing order of frequency. So words that appear in the NT most often are taught first. We know from the last post that after about 200 words, it would know every word that appears 100 or more times. But what does this mean.

In particular, if we learned words this way, when would we be able to read our first NT verse? And after learning 200 words, how many verses would we be able to read without help?

We can calculate this very easily. Using our frequency ordered word list we can assign each verse in the bible a score based on the lowest item in the word list. So a verse with no rare words will have a low score, and a verse containing a very rare word will score highly.

It turns out the easiest verse to read, on this scoring system, is Matt 16:15 '"But what about you?" he asked. "Who do you say I am?"' (λέγει αὐτοῖς ὑμεῖς δὲ τίνα με λέγετε εἶναι) It scores 33. So its most uncommon word (τίνα, from τίς "who") is 33 on the frequency word list. Actually all the other words in the verse are in the top 10, even in English you can see that most of them are very common (the verb λέγω "to say" is at number 9, but is the most common verb in the NT).

After Matt 16:15 comes 1 Cor 3:23 'as we are to Christ, Christ is to God' (ὑμεῖς δὲ Χριστοῦ Χριστὸς δὲ θεοῦ) with a score of 35. Then there's a small gap before we get several with scores in the 50s and 60s (some are just short, like John 10:30, others are surprisingly complex, like 1 Cor 8:6, with 27 greek words in the verse, but a score of just 50 - coming in joint 3rd overall).

Now, as in the previous post, I've ignored morphology, which skews these results. You couldn't just teach someone 33 words and have them read Matt 16:15. To get a proper curriculum for the hypothetical greek learning machine, we'd need to include grammar, and that is the subject for another post.

In case you're wondering at any point. I'm using the NA27 text, as marked up and published by James Tauber at his now defunct MorphGNT site (the source texts are still available at the site, however).

Monday 25 May 2009

So the first question I looked at (a little warm-up really) is: what is the frequency-ordered word-list of NT greek?

The motivation for this was partly so I could get a sense of how much I'd need to know for the different 'hint' levels at John Dyer's Readers Greek Bible. If you haven't seen the site, it allows you to get footnote hints for words that appear fewer than X times in the NT. I wanted to know roughly what my vocab size would need to be to read at each level.

The answers are:

FrequencyNumber of Words
100171
90194
80212
70238
60271
50310
45338
40377
35416
30463
25545
20636
15809
101126


This analysis ignores word forms (so all declensions of a noun count as the same word, for example). There are 5463 different words in the NT using this method of counting.

These numbers seemed very low to me. I think they make it seem simpler than it really is - with 200 greek words you would struggle to read even with help set at level 100, because you are likely to come across novel forms of words you know. And you'd have to know the words in their strict order of occurrence, which nobody does. I'm pretty sure my greek vocab has more than a thousand words in it, but I struggle to read comfortably below level 30.

But still it is interesting.

One of the questions this raises for me, is how much vocab you'd need to know to start reading the NT.

This blog is intended to hold notes and result from my dalliances with computational linguistics and the New Testament.

My aim at this point is to play around with the NT in order to practice my greek and the learn more about the text. There are a couple of areas I'm interested in:

First is learning more NT greek. I learned greek in my theology degree, and have kept up with it to a modest level. But I'm still a real amateur, so I'm interested in whether mathematical engagement with the text can help me master the language quicker or more fully.

Secondly I'm interested in the Synoptic Problem, and in particular whether there are analyses that can shed light on them. I think there are some techniques from bioinformatics that could directly illuminate these problems, but getting to the point where I can do that analysis would be tricky.

Thirdly I'm perpetually interested in visualization in lots of contexts. Visualization is time consuming, but if I get time it would be good to post some graphs, charts and other images here.