Ticket #55 (closed defect: fixed)

Opened 20 months ago

Last modified 19 months ago

Greek Text Lost in Conversion

Reported by: rast Owned by: jsosin
Priority: critical Milestone: First Static Release
Component: ddb xml Version:
Keywords: Cc: zau

Description (last modified by rast) (diff)

Greek letters that were present in the SGML have disappeared from the XML/HTML.

A common factor observed in all cases is the presence of dotted letters on _both_ sides of white space. When converted, any dotted letters directly to the right of the white space have fallen out. I imagine that this will have affected thousands of words.

PSI 7.804.5

b?a?l? has dropped out of the XML after [sio]u? and before [aneio]u? at the beginning of line 5. It is present in the SGML.

More examples of letters absent from XML (present in SGML):

P.Koeln IV 202, line 8 -- e? has fallen out before mbo-
http://apptest.cul.columbia.edu:8082/ddbdp/html?identifier=oai:papyri.info:identifiers:ddbdp:0144:4:202

P.Koeln IV 196 recto, line 4: M?e?s? has fallen out before ore
http://apptest.cul.columbia.edu:8082/ddbdp/html?identifier=oai:papyri.info:identifiers:ddbdp:0144:4:196

P.Koeln IV 192, line 1: N? has fallen out before [ .]s?mar[ .]o?s

http://apptest.cul.columbia.edu:8082/ddbdp/html?identifier=oai:papyri.info:identifiers:ddbdp:0144:4:192

Change History

Changed 20 months ago by jcowey

Worth trying to analyse what may have gone wrong in order to try running a regex or two just in case similar types of markup in SGML has also been lost.

Changed 20 months ago by rast

  • description modified (diff)

Changed 20 months ago by rast

  • description modified (diff)

Changed 20 months ago by rast

  • priority changed from major to critical
  • description modified (diff)

Changed 20 months ago by jcowey

Message from Uri Yitach via Skype. "please have a look at bgu 1155 at the end of line 18. I think that the online ddb misses a me- at the end."
This is similar to what has been described above.

See now #SGML2EpiDocUseCase

Changed 20 months ago by jcowey

See now SGML2EpiDocUseCase

Changed 20 months ago by jcowey

Changed 20 months ago by thomase

  • status changed from new to accepted

Changed 20 months ago by thomase

  • cc zau added
  • owner changed from thomase to gbodard
  • status changed from accepted to assigned
  • milestone set to First Static Release

Changed 20 months ago by gbodard

  • owner changed from gbodard to zau

We've looked at this, and created a new canary file for the five instances James listed in this ticket (see http://epiduke.cch.kcl.ac.uk/canaries/canaries-2008-12-09/). We have also ascertained that the problem is caused not at the CHETC-regex phase but in CHETC-cleanup-XSLT. Assigning to ZA for further investigation.

Changed 20 months ago by gbodard

Further, we have confirmed that the error appeared sometime between 2008-08-07 and 2008-08-11.

Changed 20 months ago by zau

  • owner changed from zau to jsosin

Josh, we think we've fixed this but could do with some real examples to make sure. Thanks!

Changed 20 months ago by gbodard

More examples provided by James (in duplicate ticket #68):

Re ticket:55 adds texts to canaries list. Cf. bottom of SGML2EpiDocUseCase page

  • bgu.3.887: affects lines 11, 16, 18, 23, 26, 30
  • bgu.3.918: affects lines 5, 11, 14, 21, 22, 28
  • bgu.14.2396: affects lines 1, 2, 5, 7, 9, 10
  • bgu.18.1.2756: affects lines 3, 7, 8, 13, 14, 15, 16, 20, 21
  • chla.11.465: affects lines 1, 6, 10, 11, 24
  • cpr.7.18: affects lines 3, 4, 7, 10, 15, 17
  • cpr.10.127: affects lines 1, 5, 6, 7, 9, 23
  • p.ant.3.198: affects lines 1, 3, 5, 6
  • p.col.7.170: affects lines 2, 3, 4, 6, 8, 14, 15, 22
  • p.diosk.4: affects lines 1, 6, 8, 17, 19, 22
  • p.gen.2.1.35: affects lines 4, 8, 11, 12, 15
  • p.harrauer.52: affects lines 2, 4, 5, 9, 14

Changed 19 months ago by gbodard

  • status changed from assigned to closed
  • resolution set to fixed

All fixed, checked in canaries, and signed off by JMSC and JFS. Closing ticket pending new run today (Dec 22).

Note: See TracTickets for help on using tickets.