Victor Mote
& Family

	Victor
	Darla

Marshall Bible
Project Management

Dave Utter, GFA missionary to the Marshall Islands, is in the process of publishing the Bible in Marshallese, using the modern Marshallese orthography. Victor Mote is attempting to assist in the typesetting for this project. Dave is the decision-maker on all issues; Vic's purpose is to relieve Dave from as much of the technical, management, and grunt work as is feasible, so that Dave can stay as focused as possible on the language issues. The purpose of this page is to document decisions, progress, and to-do items for the project:

Prototype
Major Challenges
Resolved Issues
Intractable Problems
Dave's To-Do List
Vic's To-Do List
Tokens for File Editing
Word-to-XML Conversion
Toolbox for Document Markup
Toolbox for Document Processing
DTD Documentation

Prototype

Believing that it is sometimes easier to show than to explain, here is a link to the current version of our prototype, Matu (Matthew): (PDF, 534 KB, 93 pages). Note that the combining diacriticals are now aligned more-or-less properly. However, there are still numerious layout issues, especially related to the relationship between side notes and footnotes.

Matthew was converted from InDesign. All other books are being converted from Word. Our prototype for the Word-to-XML conversion is Mark: (PDF, 345 KB, 55 pages).

Major Challenges

Major challenges on this project include:

General Bible layout issues. Bibles often include marginalia for cross-references to other Scripture passages and brief notes.
Marshallese issues. In addition to the Roman character set, Marshallese has 16 non-ASCII characters which must be encoded properly, and for which fonts must be obtained or created.
Platform differences. Dave uses Mac, Vic uses Linux and Windows. This presents some challenges for tools and support, although we think they are all resolvable.
Geographical distance. Dave is in the Marshall Islands, Vic is in Colorado Springs. The Internet helps, but Dave currently has access only to an expensive dialup service.

Resolved Issues

Use an XML to XSL-FO to PDF workflow. This decision is tentative, but seems to be superior to any others.
Use standard Unicode encoding of Marshallese characters, where possible. For characters with no Unicode code points, we are currently using code points in one of the private-use sections, starting at U+F000. The "correct" way to do this is to use the combining diacriticals to compose the missing characters, but all of the application software seems unable to do the combinations correctly. This may change before publication. The importance of using the correct encoding is primarily for those cutting and pasting, or searching the electronic version of the document. It may also be important if we try to build indexes down the road. For now, the non-Unicode characters will be encoded in our documents as XML entities. See "Setting Type in Marshallese" (on EO's web site) for a fuller discussion of these issues.
Marginalia should be in a column on the inside of the page (left for recto pages, right for verso pages).
Workaround Intractable Problem #1 in one of two ways: 1) allow the marginalia region to intrude into the footnote region, or 2) place all marginalia on the right side of each page (regardless of recto/verso). The final decision on which of these two workarounds to use will be made by Dave, if necessary, prior to publication.
We will optimize the output PDF for print use (instead of on-screen use). We'll add as many electronic features as possible, but, where there are conflicts, we will choose the option that works best on paper.
For now, we will not attempt to share a CVS or SubVersion repository. Vic will continue to use one on his local machine, but we won't try to give Dave access to it. After we get going, our workflow should be simple enough that simply emailing files back and forth and trading tokens should be sufficient. If we need to, we will change this, but it will be an added layer of complexity for Dave that I would like to avoid for now. (If we do change, Vic needs to either rent server space or get static IP address(es) to his internal machine.)
For now, we will continue to actually build the output in Colorado Springs. (If needed, we can change this. To do so, will require Dave to license the XEP product, approx. $300, plus training, converting scripts, etc.).
For now, place OT quotations inside an <Emphasis2> tag. (Other options considered were 1) adding an <InlineSpecial Variant="OTQuote"> element, 2) adding an <OTQuote> element.) If we need the additional information, we can convert to a different scheme later.
For now, we will not use old-style figures. After comparison with other Bibles, it looks like it would be too distracting.

Intractable Problems

The following are issues for which there appears to be no current solution, based on the parameters in the Resolved Issues section:

The current solution for placing marginalia on the inside of each page allows (or rather forces) the marginalia region of the page to intrude into the footnote region. We want the opposite to occur, but current limitations of XSL-FO make this impossible. There are several ways this issue may be resolved before masters are created. Vic has proposed changes to the standard that may eventually provide a solution.
The current solution for placing marginalia on the inside of each page causes some clearance problems with other float items on the page, specifically drop-caps. In some rare and usually insignificant cases, the marginalia notes are lower on the page than we would desire because they are forced to be clear (horizontally) from the drop-caps. This may affect us with other floating items also.
The current solution for placing marginalia on the inside of each page causes a small, extra, unneeded indentation on the outside of each marginalia item. This corresponds to the space that is between the body text and the marginalia (there is no way to tell it to indent just the inside, so we must indent both).

Dave's To-Do List

Review the section Tokens for File Editing and either 1) let Vic know that he understands the purpose of this, or 2) ask Vic to clarify what is going on here.
Send Vic table containing the 3-character abbreviations for each book, so that he can convert the Scripture references to this format.
Consider whether sidenotes should be granularized at the verse level (as is currently done), or whether they should be tied to a specific spot in the text. (Actually, as I think about it, the status quo is probably a result of the InDesign limitations -- perhaps your intent was to tie them to a specific place in the text). The Scofield Bibles use lowercase italic letters as references to specific spots in the text, and I think we could do something similar, with numbering starting over each chapter. This would be more space-efficient than placing the chapter:verse notation within a note itself, but might require extra space between notes, as it could result in more notes, or perhaps (like the Scofield Bible) the text is indented relative to the numbering. Important: Even if we want all of the sidenotes to be grouped together for a verse, if we have a specfic place in the verse that the note belongs with, we could still mark it up that way. Vic can then glue all of the <Sidenote> elements in the verse together for output purposes. This would mean that, if you decided later on to allow multiple sidenotes in each verse, and each one referenced, that this could be accomplished with only a stylesheet change. In other words, if the notes are location-specific, we could (should??) preserve that information even if we don't intend to use it right now.
Check the Mark PDF for correct character conversion. Specifically find at least one example (if possible) of each of the 16 non-ASCII characters, and make sure that it was converted correctly.
Resolve issue of whether all 66 books are in one document, or 66 separate documents, or both. Consider both print and on-line versions. For online, we might want to use 66 separate documents, but here are the issues:
- if generated separately, the page numbers will almost certainly have to start at 1 for each one, making them not match the printed text (there are some manual workarounds to this)
- if generated as one document, then split into 66 documents, links will be messed up; also the resulting files may be bloated
Resolve verse/paragraph formatting issue. There are at least two possible ways of formatting verses and paragraphs in Bibles:
- Most Bibles format each verse as a separate paragraph, and add a pilcrow sign (paragraph symbol) where a true paragraph break occurs.
- The other option is to format true paragraphs as paragraphs, and simply show the verse number as an in-line notation within the paragraph.
Because true paragraphs and verses (or even chapters) can be staggered, this is the one place where our output may affect the design of our semantic XML document. If we use option 2 above, we will need to create specialized "fragment" elements to handle straddling issues. If we use option 1, we can get away with simply dropping a pilcrow character directly in the text, and otherwise ignoring "true" paragraphs.
Decide which typeface familie(s) to use. For a professional look, the base typeface should include true small-caps variants, and the font used in the page heading (which can be the same as the body text) should probably include old-style figures. The prototype uses Adobe Caslon Pro as the base, but there are other possibilities as well. For portability between Mac and Windows (among many other issues), the font should be available in OpenType format.
Decide which font sizes should be used. The Portage defaults (and the current prototypes) are 11 pts for body text, 9 pts for footnotes & marginalia. The InDesign document for Matthew uses 10 and 8. The New Scofield Bible appears to use about 8.5 pts and 6.5 pts respectively. It uses approximately the same page size that we are using (8.5 x 5.5), but use a 2-column format. If we used a point size as small as theirs in a one-column format, the lines would be too long.
Sign off on Vic's proposal: All Scripture references should be encoded in a standard format, e.g. "01-001-001" or "Gen-1-1" or "Genesis-1-1" for Genesis 1:1. However, whichever scheme is chosen should be used consistently (i.e., if "Gen" is chosen, "Gns" or "Ge" or even "Genesis" should not be used). This facilitates programmatic manipulation of the coding to handle cross-references, hyperlinks, page headings, etc.
Check:
- Footnote at end of Matt. 25:30 -- can't tell whether it should be there or at beginning of following verse
Matthew has some missing verses: 9:13, 17:21, 23:14
What scheme should we use for hyphenation? We can create the patterns for the processor to use. Does Dave know what the rules are? If there is a good dictionary, Vic can set up a mostly empty patterns file that Dave can make additions to as needed.
Consider converting hyphens used to designate ranges to en-dashes.
Consider looking for other hyphens that should be converted to em-dashes (I see some in Matthew footnotes).
Consider moving the location of footnotes in the body text from the beginning of a word to the end.
Consider changing "A.D." in the Matthew preface to "a.d." (small caps). Also, when used in English anyway, it belongs at the beginning of the date reference, as in "in the year of our Lord, 2004."
Sidenote for Matu 2:5, on page 4 of InDesign document, comes between Matu 2:6 and Matu 2:11. I placed it with Matu 2:9, which is where it visually appeared.
Dave to check the capitalization in the following Matthew passages, which Vic converted from uppercase: 1:23, 2:6, 2:15, 2:18, 3:3, 4:4, 4:6, 4:7, 4:10, 27:37 (since Vic doesn't know the language, he is not sure which words should be capitalized)
Let Vic know whether you are licensed for the Adobe Caslon Pro typefaces (Regular, Italic, Semibold and Semibold Italic) so he can send you the modified fonts. (These are part of the Adobe Type Basics OpenType Edition package).
Vic removed the next-to-last character in the second word (Jisos) of Matu 21:12, which was a "reverse line feed" character. Make sure the text of this verse is correct.
Dave to check text insets, which Vic has placed as tables in the content.
Do you want page numbers on each page, perhaps at the bottom?
Vic added drop caps for the chapter title, as a convenient way to allow a heading at the beginning of the chapter. Do you want this? We can follow the design used in the InDesign document, perhaps by placing the "Jepta 1" text and the heading in a table. These changes can all be done by the stylesheet. Just let Vic know what are the general rules for how the start of a chapter should look.
Although the character encoding issues are mostly resolved, we might still want to address the issue of the appearance of the glyphs, especially the mark under the L, M, N, and O characters. We are encoding them to characters that are described in the Unicode standard as containing a cedilla. However, the Adobe Caslon typeface we are using draws the glyphs as what appears to be more like a comma, and we have remained consistency with that in the characters/glyphs that we have added. The Tobin book uses a true cedilla. We can easily change this in the font itself without affecting any of the underlying document after we decide how they should look.
Footnote that Vic placed Matu 12:30 was marked to be in Matu 12:38 in the InDesign document, but was in the logical position to go with the Footnote container in Matu 12:30 (i.e. it was between 12:24 and 12:31). Make sure it is in the right place.
Footnote that Vic placed in Matu 13:16 was marked to be in Matu 13:15 in the InDesign document, but logically fit the position in which Vic placed it. Make sure it is in the right place.
Ditto for footnote placed in Matu 13:44, marked for 13:42.

Vic's To-Do List

Get the Matu token back to Dave so that he can continue working on the text:
- (wait on Dave for InDesign document) Complete transfer of remaining Matu footnotes, sidenotes, and other content from the InDesign document to the XML document.
- Send the XML file to Dave & give him the token.
- Get Dave set up with an editor and trained on editing.
Find solution for pretty-printing?
Test the concept of using 3-character book abbreviations (e.g. "Gen") and full book names (e.g. "Genesis") in the semantic XML. Make sure that we can map these back to an internally normal form (i.e "01" for Genesis and "40" for Matthew) or that we can use them as-is within the stylesheets.
(wait on Dave) Create appropriate OpenType fonts for typefaces that Dave chooses. Include at least regular, bold, and italic fonts. Don't forget to create small-caps glyphs for the non-Unicode characters.
Consider adding a "Special" attribute to the <Link> element to indicate a special normalized encoding such as the proposed scheme for representing Scripture references.
Build logic to optionally tie the various books together into one document.
It looks like the private-use range I picked for the non-Unicode characters is actually used by Courier New for standard Roman ligatures (not sure why this should be as there are ranges for that elsewhere). Consider picking a different range.
(wait on Dave) After Dave approves the Mark conversion, obtain and convert the remaining books to XML.
Probable rendering bug -- xsl-before-float-separator not showing up
Probable rendering bug -- verse immediately following a multi-page table does not have text-indent working -- probably picking up some attributes from the table??

Tokens for File Editing

In lieu of giving Dave access to source code control tools, we are currently emailing files back and forth as needed. Because we both have copies of the files, and because we are not using shared source code control, it is important to keep track of who "owns" a given file at any given time. Only the party with the token to the file should edit the file. Otherwise, we will end up overwriting each other's changes. Each party, should, when passing a token to the other party, explicitly state in an email message that they are doing so. For example, when Vic is done making changes to Matu.xml, and emails it back to Dave, he should also indicate in the email that he is passing the token to Dave as well.

After we get going, this shouldn't be a big issue, because, except for major DTD changes or other systemic changes (hopefully rare), Vic's involvement with the content files should end after the Word-to-XML conversion has been completed.

As stated in the Resolved Issues section, if this gets to be too much trouble, we can use CVS instead.

The current tokens are:

File	Token Currently Belongs To
Matu	Dave
Mark	Dave
All Others	Dave

Word-to-XML Conversion

Fortunately, the Word files we wish to convert have very little formatting or markup content in them. Our main task then is to get the text dumped, character conversions done, and basic tagging. Here is the cookbook method of converting the Word files sent by Dave into XML:

Open the file in Word (instructions below are for Word 2003).
"Save As" Plain Text, with the following options:
- "Text encoding": Unicode (UTF-8)
- "Insert line breaks": Turn OFF (this will put the entire content of each verse on one line, making some of the later cleanup easier)
- "End lines with": LF only (use Unix line endings for consistency)
- "Allow character substitution": Turn OFF
Rename the saved file with extension ".xml" instead of ".txt".
Check the file in to CVS.
Open the xml file in UltraEdit32, and run the marshall3 macro. This will convert all characters to their appropriate Unicode code points.
Check in to CVS.
Copy and paste the XML and DOCTYPE declaration, and the assorted tags at the beginning and end of the file (basically everything except the chapter content).
Check in to CVS.
Adjust the book number value in the marshall4 macro (or be prepared to search and replace after running it).
Run the marshall4 macro. This will tag the <Chapter> elements.
Put all <Chapter> "id" attributes in the normal form, i.e. "c99-999".
Check in to CVS.
Adjust the book number value in the marshall5 macro (or be prepared to search and replace after running it).
Run the marshall5 macro. This will tag the <Para> elements.
Put all <Para> "id" attributes in the normal form, i.e. "v99-999-999". The book and verse portions should already be substantially correct. The chapter portion can be very quickly adjusted by using UltraEdit's "Column Mode", selecting the columns to change, and entering the correct value. A whole chapter's verses can be quickly changed in this way.
Use an XML validator to validate the document. Make corrections as necessary
Check in to CVS.
Run macros PrettyPrint and PrettyPrint_2 to break content into 80-column lines.
That should be about it. At this point it is ready to be turned back over to Dave for him to add his markup.

Toolbox for Document Markup

This is Dave's toolkit. Essential items are as follows:

A Text Editor. It must be Unicode-aware, and should be able to read and write UTF-8 files. I am not familiar with Mac editors, but it should not be hard to find a free or inexpensive editor for this purpose. Eclipse might work. Also, the basic editor that ships with Mac/OS might be fine. Should be able to save files with Unix line-endings (although another tool could be used to do this also).
A suitable font, preferably fixed-width, that has all of the Unicode characters needed for Marshallese, and which the text editor can use for editing. (The Courier New font that ships with Windows XP is adequate). The non-Unicode characters should be entered as XML entities for now. (Dave: the fonts that you have been using until now won't work because they are not using the Unicode codepoints for the Marshallese characters).
A copy of the DTD(s) used.
An XML validation tool. We can probably use Apache Xerces (free, open source) for this, since it is written in java. There are other options as well.
Documentation of how to use the appropriate DTD(s).
If we set up a CVS repository for the content, each user will need a CVS client. See the MacCVS section at WinCVS. Also, we may need SSH for security. There are some free clients for Windows. I don't know about Mac. These items are not required until (and if) we decide to set up a repository.

Nice-to-have items:

A true XML editor, with validation, context-awareness, etc. (similar to XMLSpy in the Windows world) OR an application like Authentic that allows editing of XML documents in a browser-like interface.
O'Reilly's "Learning XML, Second Edition".

Toolbox for Document Processing

This is Vic's toolkit. He is documenting it here for two reasons: 1) if, for any reason, Vic is unable to complete the project, Dave will be able to continue, and 2) to provide the scope of what might be required for Dave to bring the project in-house if he decides to do that in the future.

Essential items include all of the Markup Toolkit, including nice-to-have items, plus:

An XSL-FO processor: XEP, FOP, Antenna House. We are currently using XEP.
(Optional) A copy of the Unicode standard (3.0 is sufficient). I also found Richard Gillam's book Unicode Demystified to be very helpful. (This is optional because I think I have already addressed most of the Unicode-related issues.)
A copy of the XSL-FO standard. I also use O'Reilly's XSL-FO.
An XSLT processor. We are using the free Apache Xalan.
XSLT reference books. (Needed for writing the stylesheets). I rely mostly on O'Reilly's XSLT and to a much lesser extent XSLT Cookbook and XPath and XPointer.
XML reference books. I rely mostly on O'Reilly's XML in a Nutshell.
Scripts to invoke the XSLT and XSL-FO processors to do the build. We are using Bourne-shell scripts on Linux. (Freely available from Vic)
A copy of the stylesheets used for formatting. (Freely available from Vic)
Font editing software. We are using FontLab, VOLT, the Adobe FDK, Microsoft Font Validator, and TTX. Believe it or not, we need all of these. However, because the layout software doesn't use the OpenType features, we can get by with only FontLab or something similar. The workarounds resulting from not using the OpenType features show up in other places, mostly in improper encoding, and in the stylesheets.
(possibly) Either a public CVS repository, or static IP addresses to access our internal repository.

DTD Documentation

This project is currently using the PortageBook DTD, developed by Vic for use by Portage Publications. We will change that if necessary, but for now it works fine.

Important: It is a very possible that the DTD will need to be revised during the project. If you think a change is necessary, please contact Vic.

Much can probably be inferred by examining any existing file that has some markup in it. The following is a brief description of the elements that Dave is most likely to use in his markup work. The list is incomplete, but can be expanded later as needed:

<Para> (Type: Block) When used within a <Chapter>, use one <Para> for each verse.
- Attribute "id" marks the verse's full reference in the form nn-nnn-nnn, where the first set of digits represent the book, the second set represent the chapter, and the third set represent the verse. So, for example, Exodus 4:13 will have id="02-004-013" (the "02" indicating that Exodus is the 2nd book).
- All <Para> tags should be created during the Word-to-XML creation, so this shouldn't need to be used often.
Also used in other places in the document, most notably within any <Preface>, and within <Footnote> elements. When used in any context other than <Chapter>, the "id" attribute is generally not needed.
<Footnote> (Type: Inline) Use one <Footnote> for each footnote in the text. Each <Footnote> should contain 1 or more <Para> elements, which contain the text and any inlines needed.
<Sidenote> (Type: Inline) Use one <Sidenote> element for each cross-referenced marginalia item or group of items (at Dave's direction). Do not put the current verse number in the <Sidenote> content, as the stylesheet will do that automatically, if needed.
<Emphasis> (Type: Inline) Text within an <Emphasis> tag will be italicized in the output.
<Emphasis2> (Type: Inline) Text within an <Emphasis2> tag will be in small caps in the output. Use this for Old Testament quotations.
<Strong> (Type: Inline) Text within a <Strong> tag will be bolded in the output.
<Era> Use this for designating b.c. or a.d.
<Head1> (Type: Block) Use this for the outline-type notes that are placed in the text at the beginning of chapters or between verses.
<Comment> (Type: Inline) Use this for things that you do not want to appear in the output.
<LineBreak/> (Type: Inline) Use this for forcing a new line in the content

Some notes for markup:

Place OT quotations in an <Emphasis2> tag. Do not put them in all caps, but instead put them in with normal case. The stylesheet will do the work of font selection and any conversion work that is required to get the output correct.
Do not put the verse number in the text of the verse. The stylesheet will do this if necessary.

Marshall Bible Project Management

Marshall Bible
Project Management