Re: <div> or <seg> or ...

m.margolin＠utoronto.ca

Nov. 9, 2004

5:54 p.m.

Hi everybody, 1. <div> vs. <seg> My suggestion to use <seg> for diplomatic parts encoding is based solely on the TEI definition stating that <seg> marks a text fragment and that exactly what the diplomatic part is. On the other hand <div> is more generic by definition and may contain some meta data along with the text. I think that we should always seek the balance between specific and generic. The prize of being to generic would be a performance penalty on the any kind of information retrievals. On the other hand any attempt to enumerate content of the data (for example to use precisely named element inside of the <tenor> can lead to making encoding to restrictive and essentially not applicable. Therefore I suggest to use <div type=document> instead of <document> , keep <tenor> because of its unambiguous meaning and use <seg> to encode any diplomatic part on any level. 2. Cartulary and Document. In my understanding the subject of XML encoding is a medieval charter which might belong to one or more cartularies. From the implementation point of view it would be unwise to encode multiple charters (cartulary) in the one text file where some special elements (like <div>) would mark boundaries of the each charter. Therefore each given charter encoding should the only include a references to parent cartularies. The common approach to implementation of the repository of charters is to create an independent database (or file system) entry for the each charter. 3. Georg Vogeler wrote: a) << Maybe Michal Gervers or <<Michael Margolin could give us an example where a alteration of the <<CID definition of diplomatic parts might be necessary? The main reason of suggesting generic approach to the diplomatic encoding is that it is simply not possible to predict all content variations of the real word documents (for example we currently distinguish between "standard" and "specific" parts of the <protocol> child elements, etc. It is also very difficult to predict the depth of the encoding appropriate for the given group of charters (for example there may be more than two level of diplomatic parts subdivisions). b) << I'm working here <<with an Perl-class on the possibilities of such a search enginge - <<hoping to present you some results at the end of the term). That <<would mean that we have to stuck to the terminology of the CID as <<long as it gives us an appropriate term. I see technically no problem from the point of view of the information retrieval in the case when element name stored in the "attributes" rather then in its name. The industrial search engine and database search facilities allow to index XML elements and to use standard SQL queries to based on the attribute values. Michael Margolin, DEEDS Project, University of Toronto

Show replies by date

Gautier Poupeau

November 2004

6:28 p.m.

New subject: <div> or <seg> or ...

m.margolin@utoronto.ca a écrit :

...

Hi everybody,

1. <div> vs. <seg> My suggestion to use <seg> for diplomatic parts encoding is based solely on the TEI definition stating that <seg> marks a text fragment and that exactly what the diplomatic part is. On the other hand <div> is more generic by definition and may contain some meta data along with the text. I think that we should always seek the balance between specific and generic. The prize of being to generic would be a performance penalty on the any kind of information retrievals. On the other hand any attempt to enumerate content of the data (for example to use precisely named element inside of the <tenor> can lead to making encoding to restrictive and essentially not applicable. Therefore I suggest to use <div type=document> instead of <document> , keep <tenor> because of its unambiguous meaning and use <seg> to encode any diplomatic part on any level.

The <seg> element isn't accurate, because it can't contain the <p> element. Though you can have several paragraphs in a diplomatic part. For the <div type=document>, i prefer the <text> element as I explain in my preceding mail. The advantage of this element is you can have a <text> element for each charter. If you have one single document, you have this structure : <tei.2> <teiHeader> <teiHeader> <text> <front> Metadata and analysis of the charter </front> <body> List of witnesses and bibliography and the edited text of charter </body> <back> appended document </back> </teI.2> If you have a cartulary, you can use the <group> element instead of <body>, so you have : <tei.2> <teiHeader> </teiHeader> <front> Introduction and presentation of cartulary </front> <group> <text> .....The same structure for on charter </text> .... Much as <text> element as charter </group> <back> appended documents for the cartulary </back> </tei.2> Gautier

Georg Vogeler

5:15 p.m.

New subject: <div> or <seg> or ...

Salut Gautier, hello to everybody, the more I think about your concept the more I like it: It seems to fit perfectly to the TEI concept to take the single charter as a single text maybe being part of corpus (<group>) maybe not. Therefore I would suggest to the TEI to allow a type-attribute for the <text>-element. I'm less happy with your suggestion to encode the metadata as a <front>-element. It is right that one can take the regestum as some front matter to the charter in a scholarly edition. But talking about works like the Regesta imperii they would contain only front matters - and some front matter to the collection of front matters. I think the metadata is more something like the <msDescription>. It contains text representing real world informations on the charter or - to acuminate it - just representing the charter itself as many historians would take the abstract for the charter itself. So I would follow the suggestion of Michael Margolin to make the distinction between metadata (<regestum>) and text (<tenor>) with dedicated elements: <text type="document"> <body> <regestum> abstract, witnesslist, <listBibl type="...">, diplomatic analysis ... </regestum> <tenor> <div type="protocol"> <div type="Invocatio">In nomine sanctae et individuae trinitatis</div> <div type="Intitulatio">Heinricus <cl type="clausula devotionis">divina favente clementia</cl> romanorum imperator</div> ... </div> ... </tenor> </body> </text> But I'm willing to change my mind :-) Best wishes Georg On 9 Nov 2004 at 19:28, Gautier Poupeau wrote:

...

m.margolin@utoronto.ca a écrit :

...
Hi everybody,

1. <div> vs. <seg> My suggestion to use <seg> for diplomatic parts encoding is based solely on the TEI definition stating that <seg> marks a text fragment and that exactly what the diplomatic part is. On the other hand <div> is more generic by definition and may contain some meta data along with the text. I think that we should always seek the balance between specific and generic. The prize of being to generic would be a performance penalty on the any kind of information retrievals. On the other hand any attempt to enumerate content of the data (for example to use precisely named element inside of the <tenor> can lead to making encoding to restrictive and essentially not applicable. Therefore I suggest to use <div type=document> instead of <document> , keep <tenor> because of its unambiguous meaning and use <seg> to encode any diplomatic part on any level.

The <seg> element isn't accurate, because it can't contain the <p> element. Though you can have several paragraphs in a diplomatic part. For the <div type=document>, i prefer the <text> element as I explain in my preceding mail. The advantage of this element is you can have a <text> element for each charter. If you have one single document, you have this structure : <tei.2> <teiHeader> <teiHeader> <text> <front> Metadata and analysis of the charter </front> <body> List of witnesses and bibliography and the edited text of charter </body> <back> appended document </back> </teI.2> If you have a cartulary, you can use the <group> element instead of <body>, so you have : <tei.2> <teiHeader> </teiHeader> <front> Introduction and presentation of cartulary </front> <group> <text> .....The same structure for on charter </text> .... Much as <text> element as charter </group> <back> appended documents for the cartulary </back> </tei.2>

Gautier

_________________________________________________________ Historisches Seminar Abteilung Geschichtliche Hilfswissenschaften Ludwig-Maximilians-Universität Muenchen Postadresse: Geschwister-Scholl-Platz 1, D-80539 Muenchen Bueroadresse: Amalienstr. 52, Zi. 211 T: ++49-89-2180 3784 F: ++49-89-21 80 2084 e-mail: G.Vogeler@lmu.de http://www.geschichte.uni-muenchen.de/ghw/personen_vogeler.shtml Moderator von der Virtual Library Geschichtliche Hilfswissenchaften (http://www.vl-ghw.lmu.de)

Patrick Sahle

9:58 p.m.

New subject: Catching up with the discussion

Dear all, I'm very sorry to have been abstinent to this discussion so far. I'll try to catch up a little ... 1) Structure inside <tenor> (or whatever we will call it in the end). After all the discussion I still don't like <seg> or <div> here. Michael and Gautier have both raised good arguments against the respective elements. We should keep in mind that in our work we try to markup a very specific view on certain texts which is represented by an established vocabulary. <div> and <seg>, although extremely generic, are to be used for "primary" views on texts like global structures, layout structures and the like. In our approach I see parallels to other analytical approaches to certain types of texts (like drama, poems, spoken text etc.) or analytical perspectives (like grammar or semantics). Just as those have created dedicated tags we should try to establish things like <invocatio> or <protocol>. There is another - more generic reason - for my proposal: I expect these elements to be further differentiated in taxonomies of - for example - arengas or eschatocols. Then we would have things like <div type="arenga" subtype="xyz"> and although the concept of "subtype" already exists it is one of the ugliest I know and contradicts the whole concept of hierarchical markup, since it's (for example) not nested ... - So: I still strongly raise a plea for the whole diplomatic formular (?) on the element-level! 2) Charter numbers only as attributes: No. Please keep in mind, that my specific work regards mainly the digitisation of already printed (edited) charters as collections in charter books. There are numerous systems and ways of naming or numbering these charters. I have to bee able to keep and represent these systems. And I need the n-ATTRIBUTE to establish another, internal, global, synthetic numbering system. So I would still need a special element for these things - but I don't know how to call it. Any suggestions? 3.) Elongata as <hi type="elongata">. D'accord. 4.) facsimilia, prints, regesta, studies etc. as <listbibl>. Yes, that's what I proposed earlier. But we should press the TEI to allow a type-attribute for which we should provide a taxonomy. To use "n=printedEditions" would be a clear misuse of the semantics of the TEI-system, Gautier. Maybe we can even find allies for this case: There is work going on in the manuscript description (formerly known as MASTER) section of the TEI and I'm pretty sure that there are similar phenomena in Manuscript-Catalogues. 5.) <document> - After your discussion I have changed my mind and now do think that <group> <text type="charter"> would indeed be good solution! 6.) I'm somewhat unhappy with <regestum>. As far as I have understood this would contain ALL the information on a given charter except the text? Then we would have an extremely wide notion of "regestum" since this would include all bibliographic reference, description of physical phenomena, discussion of authenticity, discussion of the content of a charter and so on. Is this really the common use of the word "regestum"? I have some doubts. Maybe we can find a more generic word? And the restrict <regestum> to what it really means: the short summary of the main information of a given charter. And for this, the word has a perfectly clear and specific notion and should be an element rather than an attribute of another already established element. 7.) There are a lot of other points I would like to comment on. But I think we should first try to bowl down some of the open questions that have been raised so far ... 8.) Dear Michael Margolin:

...

2. Cartulary and Document. In my understanding the subject of XML encoding is a medieval charter which might belong to one or more cartularies. From the implementation point of view it would be unwise to encode multiple charters (cartulary) in the one text file where some special elements (like <div>) would mark boundaries of the each charter. Therefore each given charter encoding should the only include a references to parent cartularies. The common approach to implementation of the repository of charters is to create an independent database (or file system) entry for the each charter. ** "to create an independent database (or file system)" is not THE common approach. It is ONE out of many approaches. As far as I can see there are much more projects, dealing with much more material which start with digitising already printed charter editions and have two goals: to keep all the information of the printed version and to make them available for systematic computer aided reasearch. And - well - then we have the approach Gautier stands for: creating new charter editions from scratch, but seeing things like a cartulaire as the primary information unit and not the "abstract" charter - which indeed can have several witnesses, documents and versions. I believe that the strength of our group is, that we perfectly represent these different approaches. But we should try to solve the problems of ALL of us. Of course this raises elementary problems - coming back to your posting: "the subject of XML encoding is a medieval charter". Yes, sometimes. And sometimes it's a charter book (cartulaire) and sometimes it's a single charter in an archive and sometimes (maybe most often) it's a printed collection of charters ...

Best regards, Patrick Sahle ___________________________________________________________________ Universität zu Köln Historisches Seminar Albertus-Magnus-Platz 50923 Koeln Privat: Häuschensweg 2a 50827 Köln +49 - (0)221 - 2805695 Sahle@uni-koeln.de http://www.uni-koeln.de/~ahz26/

Georg Vogeler

3:18 p.m.

New subject: <regestum>

Hello from Munich,

...

6.) I'm somewhat unhappy with <regestum>. As far as I have understood this would contain ALL the information on a given charter except the text? Then we would have an extremely wide notion of "regestum" since this would include all bibliographic reference, description of physical phenomena, discussion of authenticity, discussion of the content of a charter and so on. Is this really the common use of the word "regestum"? I have some doubts. Maybe we can find a more generic word? And the restrict <regestum> to what it really means: the short summary of the main information of a given charter. And for this, the word has a perfectly clear and specific notion and should be an element rather than an attribute of another already established element.

We have two elements in our list which could easily be confused in German as both are called "Regest": One is the "Kopfregest" heading an scholarly edition and the other is the "Vollregest" like that they are collecting for the Regesta imperii, giving the content of the document, extensive description and some historical and diplomatic analysis. If we see the "Kopfregest" together with the following parts (textual tradition, bibliographic informationes, diplomatic analysis) and compare that kind of text with the "Vollregest" of the Regesta imperii then I find them very similar. Thus we could do it with the term "<regestum>" for the "Vollregest" (including information on textual tradition etc.) and "<abstract>" for the "Kopfregest" (or "summary" or "analyse" in french). And finally I think the archival description is similar to the "Vollregest" and also called "Regest". But Karsten Uhde can correct me. Thus I would stick to <regestum> <abstract> ...</abstract> <witList> ... </witlist> ... </regestum> <tenor> ... </tenor> Best wishes Georg _________________________________________________________ Historisches Seminar Abteilung Geschichtliche Hilfswissenschaften Ludwig-Maximilians-Universität Muenchen Postadresse: Geschwister-Scholl-Platz 1, D-80539 Muenchen Bueroadresse: Amalienstr. 52, Zi. 211 T: ++49-89-2180 3784 F: ++49-89-21 80 2084 e-mail: G.Vogeler@lmu.de http://www.geschichte.uni-muenchen.de/ghw/personen_vogeler.shtml Moderator von der Virtual Library Geschichtliche Hilfswissenchaften (http://www.vl-ghw.lmu.de)

Patrick Sahle

11:07 a.m.

New subject: <regestum>

If everybody sees <regestum> as a common and easily understandable term for what we are talking about (all the header-information of an edited charter), then there is no more resistance to this point from my side. patrick At 16:18 12.11.04 +0100, Georg Vogeler wrote:

...

Hello from Munich,

...
6.) I'm somewhat unhappy with <regestum>. As far as I have understood this would contain ALL the information on a given charter except the text? Then we would have an extremely wide notion of "regestum" since this would include all bibliographic reference, description of physical phenomena, discussion of authenticity, discussion of the content of a charter and so on. Is this really the common use of the word "regestum"? I have some doubts. Maybe we can find a more generic word? And the restrict <regestum> to what it really means: the short summary of the main information of a given charter. And for this, the word has a perfectly clear and specific notion and should be an element rather than an attribute of another already established element.

We have two elements in our list which could easily be confused in German as both are called "Regest": One is the "Kopfregest" heading an scholarly edition and the other is the "Vollregest" like that they are collecting for the Regesta imperii, giving the content of the document, extensive description and some historical and diplomatic analysis.

If we see the "Kopfregest" together with the following parts (textual tradition, bibliographic informationes, diplomatic analysis) and compare that kind of text with the "Vollregest" of the Regesta imperii then I find them very similar. Thus we could do it with the term "<regestum>" for the "Vollregest" (including information on textual tradition etc.) and "<abstract>" for the "Kopfregest" (or "summary" or "analyse" in french). And finally I think the archival description is similar to the "Vollregest" and also called "Regest". But Karsten Uhde can correct me.

Thus I would stick to

<regestum> <abstract> ...</abstract> <witList> ... </witlist> ... </regestum> <tenor> ... </tenor>

Best wishes

Georg

_________________________________________________________ Historisches Seminar Abteilung Geschichtliche Hilfswissenschaften Ludwig-Maximilians-Universität Muenchen Postadresse: Geschwister-Scholl-Platz 1, D-80539 Muenchen Bueroadresse: Amalienstr. 52, Zi. 211 T: ++49-89-2180 3784 F: ++49-89-21 80 2084 e-mail: G.Vogeler@lmu.de http://www.geschichte.uni-muenchen.de/ghw/personen_vogeler.shtml Moderator von der Virtual Library Geschichtliche Hilfswissenchaften (http://www.vl-ghw.lmu.de)

___________________________________________________________________ Universität zu Köln Historisches Seminar Albertus-Magnus-Platz 50923 Koeln Privat: Häuschensweg 2a 50827 Köln +49 - (0)221 - 2805695 Sahle@uni-koeln.de http://www.uni-koeln.de/~ahz26/

Gautier Poupeau

1:19 p.m.

New subject: <regestum>

Patrick Sahle a écrit :

...

If everybody sees <regestum> as a common and easily understandable term for what we are talking about (all the header-information of an edited charter), then there is no more resistance to this point from my side.

I'm like you, Patrick, i don't like this <regestum> element in <body> element to indicate all the header-information of an edited charter. I understand Georg's argument for the Regesta imperii project. But, for me, our goal is to define a structure of a charter and not just a regestum... For me, if you edite a charter and regestum, it's different : it's two different type of document, so two different structure in XML. Morevoer, i understand the difference between Vollregest and Kopfregest, but if we read the CID definition, a regestum is a little sentence to resume and analyze the charter and not all header-information of an edited charter. It's just a part of this information. If i understand good, the <regestum> element together all the header-information and <tenor> the edited text. But, if we read the TEI guideliness, the <regestum> correspond <front> element and <tenor> correspond <body> element. But, like Patrick, if everybody is ok with <regestum> element, i didn't say nothing. Gautier

Gautier Poupeau

4:07 p.m.

New subject: charters numbers only as attributes ?

...

2) Charter numbers only as attributes: No. Please keep in mind, that my specific work regards mainly the digitisation of already printed (edited) charters as collections in charter books. There are numerous systems and ways of naming or numbering these charters. I have to bee able to keep and represent these systems. And I need the n-ATTRIBUTE to establish another, internal, global, synthetic numbering system. So I would still need a special element for these things - but I don't know how to call it. Any suggestions?

I understand your position, Patrick. Effectively, in this case, the number of the charter on the book must be between an element, because this information is in the book. In Ecole des chartes, we have a project of digitisation of cartularies edition made at the beginning XIXth. And i must confesse for the number I use <div type="numero"> ;-) It's not very good but it's the only solution we found. We considere this information makes a whole of the structure of the text. Incidentally, i use this occasion to explain quicky our project, because i think it could be interresting. We digitize in XML TEI the edition, the index and in pictures all the book with introduction. We use the METS schema (http://www.loc.gov/standards/mets/) to restore in a xml file the physical structure of the book, so we can say for one page which files or part of files (JPEG, TIFF, XML) correspond. The use of METS allow to concentrate on the semantical structure in the TEI files (we use the <pb/> element to indicate the page in the TEI files) and in the physical structure in the METS files. Best wishes Gautier Poupeau

Georg Vogeler

4:52 p.m.

New subject: charters numbers only as attributes ?

Hi everybody, Gautier's and Patrick's contributions bring me back to the suggestion of a dedicated element <num> in the CEI proposal for the TEI. It might be good for the TEI to have that element for other projects to as digitising existing prints seems to me a common enterprise. And, Gautier, your project looks again very interesting! Best wishes Georg On 12 Nov 2004 at 17:07, Gautier Poupeau wrote:

...

...
2) Charter numbers only as attributes: No. Please keep in mind, that my specific work regards mainly the digitisation of already printed (edited) charters as collections in charter books. There are numerous systems and ways of naming or numbering these charters. I have to bee able to keep and represent these systems. And I need the n-ATTRIBUTE to establish another, internal, global, synthetic numbering system. So I would still need a special element for these things - but I don't know how to call it. Any suggestions?

I understand your position, Patrick. Effectively, in this case, the number of the charter on the book must be between an element, because this information is in the book. In Ecole des chartes, we have a project of digitisation of cartularies edition made at the beginning XIXth. And i must confesse for the number I use <div type="numero"> ;-) It's not very good but it's the only solution we found. We considere this information makes a whole of the structure of the text. Incidentally, i use this occasion to explain quicky our project, because i think it could be interresting. We digitize in XML TEI the edition, the index and in pictures all the book with introduction. We use the METS schema (http://www.loc.gov/standards/mets/) to restore in a xml file the physical structure of the book, so we can say for one page which files or part of files (JPEG, TIFF, XML) correspond. The use of METS allow to concentrate on the semantical structure in the TEI files (we use the <pb/> element to indicate the page in the TEI files) and in the physical structure in the METS files.

Best wishes

Gautier Poupeau

Patrick Sahle

11:08 a.m.

New subject: charters numbers only as attributes ?

At 17:07 12.11.04 +0100, you wrote:

...

...
2) Charter numbers only as attributes: No. Please keep in mind, that my specific work regards mainly the digitisation of already printed (edited) charters as collections in charter books. There are numerous systems and ways of naming or numbering these charters. I have to bee able to keep and represent these systems. And I need the n-ATTRIBUTE to establish another, internal, global, synthetic numbering system. So I would still need a special element for these things - but I don't know how to call it. Any suggestions?

I understand your position, Patrick. Effectively, in this case, the number of the charter on the book must be between an element, because this information is in the book. In Ecole des chartes, we have a project of digitisation of cartularies edition made at the beginning XIXth. And i must confesse for the number I use <div type="numero"> ;-) It's not very good but it's the only solution we found. We considere this information makes a whole of the structure of the text. Incidentally, i use this occasion to explain quicky our project, because i think it could be interresting. We digitize in XML TEI the edition, the index and in pictures all the book with introduction. We use the METS schema (http://www.loc.gov/standards/mets/) to restore in a xml file the physical structure of the book, so we can say for one page which files or part of files (JPEG, TIFF, XML) correspond. The use of METS allow to concentrate on the semantical structure in the TEI files (we use the <pb/> element to indicate the page in the TEI files) and in the physical structure in the METS files.

** This sounds like just another very fine, very advanced and very sophisticated project. I'm impressed and would like to see more of your usage of METS. As regards the usage of <div type="numero">: of course I don't like this but can understand your reasons to use it anyway ... patrick

...

Best wishes

Gautier Poupeau

Gautier Poupeau

6:45 p.m.

New subject: <div> or <seg> or ...

Effectively, it will be hard to normalize this. But, it's important to normalize a little and the terminology of CID can be a good start point. Indeed, if we normalyze, we could do for example this type of interrogation : find the term "in nomine Dei" in the protocole. If we don't normalize the content of our attribute, we couldn't do this interrogation. Gautier

...

3. Georg Vogeler wrote: a) << Maybe Michal Gervers or <<Michael Margolin could give us an example where a alteration of the <<CID definition of diplomatic parts might be necessary? The main reason of suggesting generic approach to the diplomatic encoding is that it is simply not possible to predict all content variations of the real word documents (for example we currently distinguish between "standard" and "specific" parts of the <protocol> child elements, etc. It is also very difficult to predict the depth of the encoding appropriate for the given group of charters (for example there may be more than two level of diplomatic parts subdivisions). b) << I'm working here <<with an Perl-class on the possibilities of such a search enginge - <<hoping to present you some results at the end of the term). That <<would mean that we have to stuck to the terminology of the CID as <<long as it gives us an appropriate term.

I see technically no problem from the point of view of the information retrieval in the case when element name stored in the "attributes" rather then in its name. The industrial search engine and database search facilities allow to index XML elements and to use standard SQL queries to based on the attribute values.

Michael Margolin, DEEDS Project, University of Toronto

7825

Age (days ago)

7829

Last active (days ago)

List overview

Download

10 comments

4 participants

participants (4)

Gautier Poupeau
Georg Vogeler
m.margolin＠utoronto.ca
Patrick Sahle