Over the last few weeks, I've made a number of new releases of the MorphGNT SBLGNT analysis fixing some accentuation issues mostly in the normalization column. This came out of ongoing work on modelling accentuation (and, in particular, rules around ...

 

New MorphGNT Releases and Accentuation Analysis and more...



New MorphGNT Releases and Accentuation Analysis

Over the last few weeks, I’ve made a number of new releases of the MorphGNT SBLGNT analysis fixing some accentuation issues mostly in the normalization column. This came out of ongoing work on modelling accentuation (and, in particular, rules around clitics).

Back in 2015, I talked about Annotating the Normalization Column in MorphGNT. This post could almost be considered Part 2.

I recently went back to that work and made a fresh start on a new repo gnt-accentuation intended to explain the accentuation of each word in the GNT (and eventually other Greek texts). There’s two parts to that: explaining why the normalized form is accented the way it but then explaining why the word-in-context might be accented differently (clitics, etc). The repo is eventually going to do both but I started with the latter.

My goal with that repo is to be part of the larger vision of an “executable grammar” I’ve talked about for years where rules about, say, enclitics, are formally written up in a way that can be tested against the data. This means:

  • students reading a rule can immediately jump to real examples (or exceptions)
  • students confused by something in a text can immediately jump to rules explaining it
  • the correctness of the rules can be tested
  • errors in the text can be found

It is the fourth point that meant that my recent work uncovered some accentuation issues in the SBLGNT, normalization and lemmatization. Some of that has been corrected in a series of new releases of the MorphGNT: 6.08, 6.09, and 6.10. See https://github.com/morphgnt/sblgnt/releases for details of specifics. The reason for so many releases was I wanted to get corrections out as soon as I made them but then I found more issues!

There are some issues in the text itself which need to be resolved. See the Github issue https://github.com/morphgnt/sblgnt/issues/52 for details. I’d very much appreciate people’s input.

In the meantime, stay tuned for more progress on gnt-accentuation.

       
 

Diacritic Stacking in Skolar PE Fixed

Back in Polytonic Greek Unicode Still Isn’t Perfect and An Updated Solution to Polytonic Greek Unicode’s Problems I talked about problems with stacking vowel length and other diacritics. At least in terms of the font used on this site, the problems are now solved.

After discussions on the Unicode mailing list, it was clear that the solution to better handling of complex diacritic stacking in polytonic Greek was NOT more precomposed forms but better support in fonts, etc. So I reached out to David Březina, the creator of the Skolar typeface, used on this site, to see if the issues could be addressed.

I’m delighted to say that Březina’s foundry Rosetta Type has released new versions of Skolar PE that address all the issues I had.

I’ve now switched over this site to use the new version, which does mean those old posts complaining about the issues will read a little funny as they won’t actually show examples of the problems they purport to.

Thank you, David, for listening to my input and making my favourite Greek typeface even better!

UPDATE (2017-01-06): turns out I also needed to add font-feature-settings: "ccmp"; for it to work on Safari.

       
 

First Pass of MorphGNT Verb Coverage and LXX Beginnings

In greek-inflexion and an Update on the Morphological Lexicon I said that all the verbs in the MorphGNT SBLGNT analysis should be done by the end of the year. I hit that goal and made a decent start on the Septuagint.

As mentioned in that previous post, by May 2016 I could generate every single verb form in:

  • Louise Pratt’s intermediate grammar
  • Helma Dik’s Greek verb handouts
  • Andrew Keller & Stephanie Russell’s beginner-intermediate text book

On December 8th, I’d actually finished coverage of all the verbs in the MorphGNT SBLGNT (with a little bit of help from Nathan Smith).

The stem database is available at https://github.com/jtauber/greek-inflexion/blob/morphgnt/morphgnt_lexicon.yaml. I should emphasize, though, this is just a first pass and there’s more work to do but the coverage is now there.

I immediately started work on applying the greek-inflexion code and stemming rules to the CATSS analysis of the LXX. By the end of 2016, I’d built a stem database and updated the stemming rules to cover the Pentateuch, 1 Maccabees, Jonah, Nahum, and Ezra-Nehamiah. Work on the rest of the CATSS analysis will continue over the next few months.

I decided to start a new stem database from scratch for the LXX (although I recently wrote a script to compare stem databases for inconsistencies). My primary reason for this was to see if I ended up with the same analysis for a verb stem as a way of catching potential errors in my original MorphGNT analysis. The classical Greek exemplars listed above, the MorphGNT SBLGNT and the LXX analysis all share the same stemming rules, though.

My reasons for doing the stem analysis on the CATSS morphological analysis were threefold:

  • expand coverage of the stem database to more parts for existing verbs as well as new verbs
  • provide broader tests for the stemming rules
  • prepare for a morphological analysis of the Swete text of the LXX/OG.

A fourth benefit quickly emerged, though: I found errors in the CATSS analysis.

I’ve been maintaining patch files which, after a review pass, I’ll contribute back to CCAT (if they are interested). Fun fact: it was contributing corrections back to the CCAT’s GNT analysis which started me on the path to MorphGNT 24 years ago!

The patches are available at https://github.com/jtauber/greek-inflexion/tree/lxx/lxxmorph. They need to be reviewed as they all pretty much assume the text is correct (including accentuation, which was a major reason for the corrections I made) and I’ve redone the analysis without considering context. An easy way to contribute would be to help review these patch files.

All this work on greek-inflexion has led to some improvements to the underlying inflexion library as well as numerous corrections to greek-accentuation.

Work on the LXX coverage will continue as well as expansion to other texts (both Hellenistic and Classical).

Also in an early stage is better modeling of stem formation and endings.

Finally, the fruits of all this will soon be applied to the online Greek reader I talked about at SBL 2016, with a goal to release a prototype for the Johannine gospel and epistles in a couple of months.

       
 

Polytonic Greek Unicode Still Isn’t Perfect

Whether we’re talking about fonts, programming languages, keyboard entry or even the command-line, support for polytonic Greek has greatly improved even in the last 10 years much less the 23 years since I’ve been doing computational analysis of Greek texts.

UPDATE (2016-12-04): The Skolar examples in this post will no longer make sense as the issues have now been fixed. See Diacritic Stacking in Skolar PE Fixed.

With configurable input sources in OS X, it’s easy to type polytonic Greek and the default fonts support all the Unicode codepoints for polytonic Greek. I can now just type Greek (rather than a transliteration or BetaCode) in data files or forum posts or emails or tweets or GitHub issues. There are still some display issues with using polytonic Greek in fixed-width fonts but that’s improving. Last year I talked about the bug I reported that got fixed in the Atom editor.

Python has long supported Unicode and Python 3 made it even easier to deal with text processing of Unicode files. It doesn’t sort polytonic Greek correctly out of the box, but I wrote pyuca to solve that problem!

The situation seemed almost perfect until I started doing a lot more work that required me to track vowel length and, in particular use a macron ˉ to distinguish long α, ι, and υ from short. It’s okay when the macron is the only diacritic on a vowel: the problems start when a vowel has both an acute and a macron. (There is no need for a macron and a circumflex as the circumflex already implies the vowel is long. Same with an iota subscript.)

Problem 1: No precomposed character code points

ᾱ can be written as the decomposed U+03B1 U+0304 or the precomposed U+1FB1:

>>> len('ᾱ')
1
>>> [hex(ord(ch)) for ch in 'ᾱ']
['0x1fb1'] 
>>> [unicodedata.name(ch) for ch in 'ᾱ']
['GREEK SMALL LETTER ALPHA WITH MACRON']
>>> unicodedata.decomposition('ᾱ')
'03B1 0304'

ά can be written as the decomposed U+03B1 U+0301 or the precomposed U+03AC (assuming normalization to a tonos which the Greek Polytonic Input Source on OS X does):

>>> len('ά')
1
>>> [hex(ord(ch)) for ch in 'ά']
['0x3ac']
>>> [unicodedata.name(ch) for ch in 'ά']
['GREEK SMALL LETTER ALPHA WITH TONOS']
>>> unicodedata.decomposition('ά')
'03B1 0301'

But there’s no precomposed character ᾱ́:

>>> len('ᾱ́')
2
>>> [hex(ord(ch)) for ch in 'ᾱ́']
['0x1fb1', '0x301']
>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', 'ᾱ́')]
['0x1fb1', '0x301']

As you can see, even Python 3 views ᾱ́ as two characters. This also screws up font metrics in many text editors and browser text areas (like the one I’m writing this post in).

Problem 2: Many fonts with otherwise excellent polytonic Greek support don’t display it properly

The Skolar PE font I use on this site can’t properly display ᾱ́. It displays it as ᾱ́. Ironically this is one time the fixed width fonts do a better job!

Problem 3: You can’t normalize an alternative ordering of diacritics

If you already have a GREEK SMALL LETTER ALPHA WITH TONOS and you add a COMBINING MACRON you end up (at least in the fonts I’ve tried) with something that even visually looks different from the GREEK SMALL LETTER ALPHA WITH MACRON followed by COMBINING ACUTE ACCENT:

>>> "\u03ac\u0304"
'ά̄'

(Notice that ά̄ != ᾱ́ and oddly, Skolar PE does a better job of the former than the latter: ά̄ vs ᾱ́)

And to make matters worse, you can’t normalize one to the other:

[hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03ac\u0304')]
['0x3ac', '0x304']

You have to combine the components in the correct order with the macron FIRST:

>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03b1\u0304\u0301')]
['0x1fb1', '0x301']
>>> [hex(ord(ch)) for ch in unicodedata.normalize('NFC', '\u03b1\u0301\u0304')]
['0x3ac', '0x304']

This is not a bug: technically ά̄ and ᾱ́ are distinct graphemes but it’s still an annoyance because it requires any code that adds diacritics to need to know the correct order in which to add them.

Problem 4: No support in the Greek Polytonic Input Source

The Greek Polytonic Input Source supports typing a digraph (diacritic then base) to produce precomposed characters but you can’t use a trigraph to enter ᾱ́. In fact, every time I’ve needed to type ᾱ́ in this post, I’ve needed to copy paste it from an earlier usage (and manually minted one via Python the first time).

Problem 5: My existing syllabification heuristics didn’t work

I recently had to tweak the syllabification heuristics in my greek-accentuation Python library to correctly syllabify words like φῡ́ω. Prior to 0.9.4, it put a syllable division between the macron and the acute!

This would have not happened if Unicode (and hence Python) treated ῡ́ as a single character.

Problem 6: There’s also breathing

I thought I was all set after fixing Problem 5 but then I hit the imperfect of ἵστημι which starts in most cases with ῑ́̔/ῑ̔́ (yes, that should be a rough breathing and acute with a macron.) I’m in the process of working around this problem in greek-accentuation now.

The Solution

The root cause of all this is just that Unicode-based code can’t treat ῑ́̔ or ῡ́ or ᾱ́ as single characters because Unicode doesn’t have a codepoint for the precomposed characters. I imagine it’s a long road to get the Unicode Consortium to “fix” this, if it’s even possible. And even if some future version of Unicode fixed it; I’d have to wait for Python and OS X to catch up before the problem really goes away. For now I’ll just have to continue to work around the problem in code like my greek-accentuation library. That still doesn’t solve the problem with the Skolar PE fonts but I might be able to raise that issue with the font foundry.

It’s possible there are additional workarounds or tricks I’m not aware of. If there are, please let me know.

CORRECTION: Thanks to Tom Gewecke for pointing out an earlier misstatement about the Polytonic Greek Input Source on OS X producing combining characters. It does not. It supports digraphs to produce precomposed characters.

CORRECTION: Thanks to Martin J. Dürst for pointing out that ά̄ and ᾱ́ are distinct graphemes and so the fact they aren’t normalized to each other isn’t a problem with Unicode as such.

UPDATE: I remarked at the end of Problem 1 about font metrics in editors / text areas but really I should make that a separate problem. Related (and perhaps yet another problem) is selecting characters with multiple diacritics.

Updated Solution

Now see my later post: An Updated Solution to Polytonic Greek Unicode’s Problems.

       
 

An Updated Solution to Polytonic Greek Unicode’s Problems

In Polytonic Greek Unicode Still Isn’t Perfect, I enumerated various challenges that still exist with using Polytonic Greek when vowel length needs to be marked. I now have a better appreciation of what solutions are actually realistic.

After discussions with people on the Unicode mailing list, it’s clear the solution is NOT to add more precomposed character code points to Unicode (or rather, such a solution will never be adopted by Unicode). Rather, the solution likely lies in the tools just understanding grapheme clusters. For more background, see Grapheme Cluster Boundaries in the Unicode Standard Annex on Unicode Text Segmentation.

Perl 6 already has support for this: a layer above code points representing what are considered single graphemes even if made up of multiple code points. See, for example, Jonathan Worthington’s slides on Normal Form Grapheme.

So my plan is to at the very least implement a similar approach for Python 3 (unless someone else already has). That will still mean the problem has to separately be solved by:

  • font foundries
  • text editor developers
  • keyboard / input source software developers
  • operating system developers

I’ll try to engage with each of these groups and will keep people posted on my progress.

Thanks to Ken Whistler for making clear that the path forward is not in more precomposed characters but in working with system vendors and font foundries.

Thanks to Markus Scherer and Elizabeth Mattijsen for their pointers to TR29 and the Perl 6 work.

UPDATE (2016-12-04): Now see Diacritic Stacking in Skolar PE Fixed.

       
 
 
   
Click here to safely unsubscribe from "J. K. Tauber: at the intersection of computing, linguistics, biblical greek and learning science."
Click here to view mailing archives, here to change your preferences, or here to subscribePrivacy
Email subscriptions powered by FeedBlitz, LLC, 365 Boston Post Rd, Suite 123, Sudbury, MA 01776, USA.