\documentclass{mlcennote} \title{MetaLex CEN Workshop Proposal: Naming Convention for Interoperability} \author{Fabio Vitali} \institute{Dept of CS \& Law\\University of Bologna\\Italy} \author{Erik Hupkes (editor)} \institute{\defaultaffiliation} \runningauthor{Fabio Vitali} \correspondingauthor{Fabio Vitali} \email{fabio@cs.unibo.it} \Leibnizreportdate{november 2007} \begin{document} \maketitle \section{Naming Convention for Interoperability} \subsection{Scope} The scope of this naming convention is not to set a fixed way to construct URIs, but to define a minimal data set in the metadata (meta Unique$\_$ResourceIdentificator - \textit{mURI}). This \textit{mURI}) can be used as the actual URI after an XSLT/CSS transformation or a resolution mechanism (software) and may be managed by the author of the legal information resource or the editor. Some principles and characteristics should be respected in the naming convention: \begin{enumerate} \item it is a significant and logical description of the resource and not of its physical path; \item it shall be permanent and stable over time; \item it shall derive from the invariant properties of the resource so as to provide some degree of certainty in obtaining the same name for the same resource regardless of process, tool and person. \end{enumerate} \textit{mURIs} are used in numerous situations. In each case it is important to use the \textit{mURIs} for the correct level of the document. We introduce here a few particularly frequent situations: \begin{itemize} \item Legislative references will most probably refer to a work, because the references must point to something independent of all possible expressions of the work. \item The list of attachments and schedules belong to a specific expression, so references to ExpressionComponents are specific for the expression level. \item Yet the specific Manifestation that is the MetaLex/CEN XML format uses an XML-based syntax to refer to ExpressionComponents, and associate them to the corresponding ManifestationComponents containing the appropriate content. Therefore within XML files the URI of the ManifestationComponents must be used to refer to attachments and schedules. When referring to the main document, the referring URI must contain the string ``main" to point to the main document. \item Multimedia fragments within an XML manifestation (e.g., a drawing, a schema, a map, etc.) do not exist as independent ExpressionComponents, as they are only a part of the ExpressionComponent (even when they are the only part). In fact they are only ManifestationComponents, therefore referred to in object and img elements with the appropriate ManifestationComponent URI. If a multimedia fragment is referred by two different ManifestationComponents of the same Manifestation or of a different Manifestation/Expression/Work, the fragment is duplicated. \end{itemize} \subsection{Absolute and relative \textit{mURI}} A \textit{mURI} can be absolute or relative. An absolute form of a \textit{mURI} is a complete set of metadata that identifies a complete path for pointing out a specific resource. A relative form of the same \textit{mURI} is a partial set of metadata that can only identify the manifestation itself based on a certain context. These are useful to complete several \textit{mURIs}. In particular, we can observe two possible uses of relative \textit{mURI}: \begin{itemize} \item References at the \textit{work} and \textit{expression} and \textit{manifestation} level need to be specified as relative \textit{mURI} grounded on the top level. \item References at the expression and manifestation level need to be specified as relative mURI grounded on the work level. \end{itemize} In XML manifestations of CEN Metalex documents, mURIs shall always be expressed in relative form, grounded at the root level of the URI \subsection{The \textit{mURI} for the Work} The \textit{mURI} for the \textit{work} is the baseline for building the \textit{mURI} for the \textit{expression}, which is the baseline for the \textit{mURI} of the \textit{manifestation}. The \textit{mURI} for the \textit{work} consists of the following pieces: \begin{itemize} \item Country (a two or three-letter code according to ISO 3166-1) \item Type of document \item Date (expressed in YYYY-MM-DD format, but for which typedoc where the year is enough for defining the unique identification the syntax is YYYY) \item Number (when appropriate) \end{itemize} All components are separated by forward slashes (``/'') so as to exploit relative URIs in references. The repetition of the country is due to the need to make the detail fragment independent of the domain name, so as to allow both country-specific resolution as well as international resolution engines. \subsection{The \textit{mURI} of an Expression} Characterizing the Expression is the specific identification of some content with respect to another content. This includes specifications of the version and the language of the expression. Therefore, different versions of the same work, or the same version of the same work expressed in different languages correspond to different Expressions and will have different \textit{mURIs}. Expressions are organized in components (the ExpressionComponents), and therefore we need to identify separately the Expression as a whole as well as the individual \textit{mURI} for each ExpressionComponent. All of them are all immediately derived from the baseline, which is the \textit{mURI} for the WORK. \subsubsection{The mURI for the expression as a whole} The \textit{mURI} for the \textit{expression} as a whole consists of the following pieces: \begin{itemize} \item The URI of the corresponding WORK \item The character ``/'' \item The human language in which the expression is drafted \item An optional version identifier, composed of the ``@'' character followed by: \begin{itemize} \item If an approved act, the version date of the expression in syntax YYYY-MM-DD. \item If a bill, the presentation date is appropriate, or the stage in the approval process that the current draft is the result of. \item If the version identifier is not followed by a date, the identifier points to the original document. \end{itemize} \end{itemize} The absence of the version identifiers signals two different situations depending on the type of document: \begin{itemize} \item If the document is not versioned (e.g., the minutes of an assembly) then no version identifier need to nor can be present. \item If the document is versioned (e.g., an act in force), then the lack of version identifiers refers to the version in force at the moment of the resolution of the URI (i.e., the ``current'' version of the act, where ``current'' refers to the moment in time in which the \textit{mURI} is dereferenced, rather than the moment in time in which the document containing the URI was created). \end{itemize} \begin{table}[htbp] \centering \begin{tabular}{@{} |l|l| @{}} \hline /fr/minutes/2004-12-21/fr & French parliamentary debate record, 21st \\ & December 2004, French version\\ \hline /nl/act/2004-02-13/2/en & Dutch enacted Legislation. Act number\\ & 2 of 2004. English version, current \\ & version (as accessed today) \\ \hline /it/act/2004-02-13/2/it@ & Italian enacted Legislation. Act \\ & number 2 of 2004. Italian version,\\ & original version\\ \hline /hu/act/2004-02-13/2/hu@2004-07-21 & Hungarian enacted Legislation. Act \\ & number 2 of 2004. Hungarian version, \\ & as amended on July 2004\\ \hline \end{tabular} %\caption{TableCaption} %\label{tab:label} \end{table} \subsubsection{The \textit{mURIs} for ExpressionComponents} Some expressions have many components, while some are only composed of a main document. In order to explicitly refer to individual components, it is therefore necessary to introduce a naming convention that identifies individual components, and still allows an easy connection between the component and the expression it belongs to. \\ There are therefore two subcases.\\ \textbf{The expression is only composed of one component}\\ In this case, the \textit{mURI} for the expression as a whole and for its main component are identical. \\ \textbf{The expression is composed of many components}\\ The \textit{mURI} for each ExpressionComponent consist in this case of the following pieces: \begin{itemize} \item The URI of the corresponding EXPRESSION as a whole \item The character ``/'' \item Either \begin{itemize} \item A unique name for the attachment \item The name ``main'' which is reserved for the main document. If we have different main we should number them in sequence way: main1, main2, etc. \end{itemize} \end{itemize} \subsubsection{Hierarchies of components in ExpressionComponents} Frequently, a situation occurs when an attachment has itself further attachments. This creates a complex hierarchical situation in which the component should be considered, in a way, an expression itself, whose components should be listed as well and properly differentiated. The process can be further iterated, in the situation in which not only an attachment to an expression has further attachments, but its attachments also have further attachments and so on. The situation must also foresee the situation in which attachments at different levels of the hierarchy end up having the same name (e.g., table A in schedule 1 and table A in schedule 2). In such cases, each ExpressionComponent must be considered as an expression by itself. Recursively, the \textit{mURI} of attachments are as follows: \begin{itemize} \item If the attachment does not have further attachments, its \textit{mURI} is provided as detailed in the previous section, without further addenda. \item If the attachment has further attachments, the URI as detailed in the previous section refers to the whole attachment, including its own attachments. \item To refer to the main document of an attachment that has further attachments, a further ``/main'' part should be added. \item To refer to any further attachment of an attachment, a further ``/'' followed by a unique name for the attachment must be added to the attachment itself. \end{itemize} \begin{table}[htbp] \centering \begin{tabular}{@{} |l|l| @{}} \hline {\small /fr/minutes/2004-12-21/fr/main} &French parliamentary debate \\ & record, 21st December 2004, \\ & French version, main act\\ \hline {\small /nl/act/2004-02-13/2/en/main/annex1} & Dutch enacted Legislation. \\ & Act number 2 of 2004. English \\ & version, current version (as \\ & accessed today), annex1 to the \\ & main document\\ & (as accessed today) \\ \hline {\small /it/act/2004-02-13/2/it@/main/annex1/table3} & Italian enacted Legislation. \\ & Act number 2 of 2004. Italian \\ & version, original version, table3 \\ & of the annex1 of the main\\ & document\\ \hline {\small /hu/act/2004-02-13/2/hu@2004-07-21/main3/map4} & Hungarian enacted Legislation. \\ & Act number 2 of 2004. \\ & Hungarian version, as amended \\ & on July 2004, map 4 of the main \\ & document number 3\\ \hline \end{tabular} %\caption{TableCaption} %\label{tab:label} \end{table} \subsubsection{The URI of virtual expressions} In some situations the actual enter-in-force date of the expression is not known in advance, and it is necessary to create references or mentions of documents whose URI is now known completely (possibly, because their exact delivery date is not known yet). These are called virtual expressions (i.e., references to expressions that probably do not exist yet or ever, but can be unambiguously deduced once all relevant information are made available). We must distinguish three cases in such situation: \begin{enumerate} \item the information is not known by the author of the expression (e.g., the legislator), in which case the act of actually retrieving the correct information is an act of interpretation; \item the information is not known by the editor of the expression (e.g., the publisher of the XML version of the document), in which case the information can theoretically be available, but is too much of a burden for the publisher to retrieve it. \item the information is not known by the query system. \end{enumerate} In all these cases, the syntax for the URI of the virtual expression uses a similar syntax to the specification of the actual expression, but the character ``:'' is used instead of the ``@'' after the specification of the work URI. For instance, if we need to reference the expression of an act in force on date ``1/1/2007'', we will probably need to refer to some expressions the enter in force date of which was in a previous date to 1/1/2007. \begin{table}[htbp] \centering \begin{tabular}{@{} |l|l| @{}} \hline /at/act/2004-02-13/2/au:2004-07-21 & Austrian enacted Legislation. Act number \\ & 2 of 2004. German version, as amended on\\ & the closest date before July 21, 2004\\ \hline \end{tabular} %\caption{TableCaption} % \label{tab:label} \end{table} \subsection{The mURI of the Manifestation} Characterizing the Manifestation is the specific process that generated an electronic document in some specific format(s). This includes specifications of the data format. Therefore, different manifestations of the same expression generated using different data formats correspond to different manifestations and will have different \textit{mURIs}. Manifestations are organized in components (the ManifestationComponents), and therefore we need to identify separately the Manifestation as a whole as well as the individual \textit{mURIs} for each ManifestationComponent. All of them are all immediately derived from the baseline, which is the \textit{mURI} for the \textit{expression}. \subsubsection{The manifestation as a whole} The \textit{mURI} for the Manifestation as a whole consists of the following pieces: \begin{itemize} \item The \textit{mURI} of the corresponding \textit{expression} as a whole. \item The character ``.'' \item A unique three letter acronym of the data format in which the manifestation is drafted. The acronym can be ``pdf'' for PDF, ``doc'' for MS Word, or ``xml'' for the XML manifestation. \item The pck for the package of all documents including XML version of the main document(s) according to the MetaLex/CEN rules. \end{itemize} \begin{table}[htbp] \centering \begin{tabular}{@{} |l|l| @{}} \hline /fr/minutes/2004-12-21/fr.doc & Word version of the France \\ & parliamentary debate record, 21st \\ & December 2004, French version\\ \hline /en/act/2004-02-13/2/en.pdf & PDF version of English enacted. \\ & Legislation Act number 2 of 2004. \\ & English version, current version (as \\ & accessed today)\\ \hline /it/act/2004-02-13/2/it@2004-07-21.pck & Package of all documents including \\ & XML versions of the Italian enacted \\ & Legislation. Act number 2 of 2004. \\ & Italian version, as amended in July\\ & 2004\\ \hline \end{tabular} %\caption{TableCaption} % \label{tab:label} \end{table} \subsubsection{The URIs for the ManifestationComponents} Each ManifestationComponent is an independent electronic structure (e.g., a file) in a single data format. Every type of manifestation has of course a different data structure and file structure. Therefore the actual format of the \textit{mURIs} of the components of the manifestation depends on the data format and cannot be formalized in general. In this section we therefore provide a grammar but not an exhaustive list of formats, that depends on the data format chosen for the manifestation. The \textit{mURI} for each ManifestationComponent consists of the following pieces: \begin{itemize} \item The URI of the corresponding expression as a whole. \item The character ``/''. \item Some unique identification of the ManifestationComponent with respect either to the manifestation as a whole or to the ExpressionComponent the component is the manifestation of. \item The character ``.'' \item A unique extension of the data format in which the manifestation is drafted. The acronym can be ``pdf'' for PDF, ``doc'' for MS Word, ``xml'' for XML documents, ``tif'' for image formats, etc. \end{itemize} In the next section we will examine the format of the package and the relevant URIs for a specific manifestation of MetaLex/CEN documents, the XML format. \subsubsection{The URI for the components in the MetaLex/CEN package manifestation} The MetaLex/CEN XML manifestation is a very specific manifestation using a number of data formats (mainly XML but could include other multimedia formats as needed) with a very specific organization of parts and components. Since it makes explicit choices in terms of data formats and reciprocal references, it is important to provide clear and non-ambiguous rules as to the internal naming mechanism and its overall structure. A MetaLex/CEN XML manifestation is a package composed of one or more files organized in a flat fashion. The transportable format is a ZIP file whose extension is ``.pck''. Other formats are possible and acceptable as long as they adhere to these rules. The following are alternative options for the MetaLex/CEN package: \begin{enumerate} \item If the document is just composed of text and does not refer to any multimedia fragment of any form, then the ZIP package contains a single document called ``main.xml''. \item If the document is composed of many ManifestationComponents but does not refer to any multimedia fragment of any form, then the zip package is composed of many XML files, \textbf{one for each ExpressionComponent}. Each ManifestationComponent is then called as its corresponding ExpressionComponent, plus the ``.xml'' extension. The name ``main'' is reserved for the main component. Numbers are never used. \item If the document contains multimedia fragments of any kind, then each individual fragment does not have a corresponding ExpressionComponent, but is just a ManifestationComponent referred to in the img or object element. All multimedia components must be stored within an inner structure (e.g., a folder) called ``media''. Multimedia components can be called freely, but must use the appropriate extension to refer to their content type. Thus a logo can be called ``logo.tif'' or any other name, as long as the extension is correctly specifying the content type. \end{enumerate} Reciprocal references to ManifestationComponents are necessary within a specific manifestation. For instance, the manifestation of the main document refers to the manifestations of its attachments via the \texttt{attachment} elements, and the schedule showing an image refers to the file of the image via the \texttt{img} element. In these cases, all references MUST be relative to the package (i.e., the manifestation as a whole). \begin{table}[htbp] \centering \begin{tabular}{@{} |l|l| @{}} \hline attachment01.xml & Manifestation of the first attachment of the current document\\ \hline schedule03.xml & Manifestation of the third attachment of the current document\\ media/logo.tif & Manifestation of an image within the current document\\ \hline \end{tabular} %\caption{TableCaption} % \label{tab:label} \end{table} References to ManifestationComponents are rarely, if ever, needed outside of the manifestation themselves. But if needed, they will refer to the file as follows: \begin{itemize} \item The \textit{mURI} of the corresponding EXPRESSION as a whole. \item The character ``/''. \item The relative reference to the required ManifestationComponent as specified above. \end{itemize} \subsection{The URIs for the Item} The MetaLex/CEN makes no assumption on the physical storage mechanism employed to record actual manifestations. Therefore there is no rule for mURI of the items, which are free to assume any form whatsoever and correspond to whatever storage mechanism has been employed locally. \bibliographystyle{apalike} \bibliography{biblioD3.2} \end{document}