Using of xml for the taxonomy data exchange of species between the groups of the researchers
A.I. Ivanov, A.K. Bagachanova, N.K. Potapova
Report at the 2nd republic scientific conference (25 - 28 November 2003). The theses of reports. UDK 517.9;681.3;523.165
Information technologies actively are used in the different fields of expertise, in particular in biology for the
issue of the information on animals, plants, and also for the exchange of information between the
specialists. We made the attempt to compose the base of flies (Insecta, Diptera) given for 300 forms of some families.
Today the state of the nature environment is so changed that this causes alert not
only of biologists, ecologists, but also people of observant and persons who are
to nature. This alert forced the biologists of the entire
world for the creation of data species data bank, which exists at present. But
this concerns not only the simple enumeration of species, but also those basic
biological parameters, and those mechanisms of adaptation, which make it
possible successfully to exist for species in one natural zone or another.
Influence of abiatic factors on the development of the dipterous
In the report is examined the model of documents DOM according to the specification WWW consortium of xml for the information about the systematic and the classification of animals, their types of areas and propagation in Sakha (Yakutia) region.
In the Institute of Biology of Criolitozone is accumulated and constantly is supplemented information about animals, to the places of distribution. Thus far one should establish that are separated both the places of its storage and the forms of idea and, that most important, are absent the common methods of retrieval for information and the sizes for the exchange of information are not established for it.
The machine to machine transmission of information between different groups of researchers and data banks is the most vital problem.
Up to now were used different languages of the description of taxonomy, most known of DELTA; however, at present use by its different groups is not recommended in view of its antiquating, its insufficient study at the moment of creation and with complexities by its integrations with the contemporary technologies of the transmission of information.
In the report further are examined questions of the application of xml of technologies for guaranteeing the exchange with information between the groups of researchers.
Survey of the existing technologies
In the report of working group according to the bases of data of the taxonomy of Alex Chapman  it is said, that really no the computer taxonomy rises from in 70th from the DELTA project - the standard, which requires renovation. In the adjacent regions it is possible to find examples which can be made. For example, STAR - in crystallography.
Later in Russia the group of developers ZooInt did an attempt on the integration of separate relational DBMS, as a result of which was obtained ZooInt system.
In the article written by the developers of the Russian program ZooInt it declares that zoology machine data banks runs into the specific difficulties, produced by the enormous number of species of the animals (more than million) and the extremely branched multilevel hierarchy of systematic (more than 40 taxonomic ranks), of that of constantly changing characteristic, perhaps, only for zoology both on the low and at the summit levels. Many scientific names of species have synonyms, whose number in some categories reaches ten. Furthermore, at each moment of time for each natural group of taxons there do exist, as a rule, in parallel several alternative systems - several different ideas about the number of these taxons and about the structure of their hierarchical representation
Anyway, the implementation of relational DBMS or another DBMS using non-relational model of data keeping there must not set limitation on the completeness of information; namely this occurs because of the carelessness of information. It is possible to consider that the data banks must be information retrieval systems with the possibility of fuzzy search.
Creation in recent years of the entire class of xml- related software makes it possible to speak about the appearance of XML technology, which includes the valuable possibilities of data manipulation, that are powerful enough themselves in comparison to the relational systems for control of the databases. Furthermore, in the contemporary versions to the utmost extent is used the support of xml of technologies.
The technologies of xml now are sufficiently developed, and from xml with documents now it is possible to produce manipulations with the use DOM, XPATH, XSLT, SAX and other different languages and technologies.
Ron Gilmour published the first determination of the document of the description of the species of animals (xml DTD)  .
At present the international working group of the taxonomic data bases conducts works for creation XDELTA, the language of marking, based on xml 
In the Republic Sakha there is an experience of creation databases of
introducents of Yakutia, the decoration plants of Yakutia, built using desktop
relational DBMS [Yegorov, Danilov, 2003, Il'in, etc., 2003]. In this report
is given the information about "Bio variety" of the dipterous insects families
(Insechta, Diptera) distributed in the Central Yakutia comprised with the xml technologies.
Storage of the information
xml documents can be stored both in the ordinary file systems, Internet or LAN depositories and in the databases such Oracle. For the sample of records XPATH can be used. Also SQL expression or expansion or through different API (ldap, Z39.50, isapi/nsapi, web DAV) can be used in tasks of selecting records.
Frequently in the implementations are used its own binary xml document representation for accelerating the search and for decreasing the volume of the occupied space, from which then with the demand of document occurs its restoration. xml document in the comparison with the text is more structured, so that binary (or decomposing) ideas gives known speed advantages of the search over the contextual search in full-text databases.
In particular, in the case of full-text search is required the reformation of entire index for all documents; in the case of xml the index can renew dynamically.
Document is organized so that the systematic of species would be most flexible as possible. Taxons are not packed into each other, but they are independent variables. This makes it possible to more easily find taxons if systematic is re-examined or generally adapts different classifications within the limits of what that of taxons.
In the figure below the main window of program is shown:
The controls of xml documents editor
Editing of the documents
In the following figures is shown one of the methods of the visual editing xml of document. Program implements two-way editing mode- in the window of the code and with the the tree of elements, and the palette of new elements and editor of attributes.
Select the edited element. In the attributes editor make changes, if it is required.
In the window of error, hints and search results the list of the obtained errors in document is reflected. xml document must be valid all the time, in contrast to, for example, html; therefore it is necessary to in proper time make corrections. The added element, in particular, can have a list of required attributes, that also is reflected.
To put new element you can, after selecting and after harvesting the button of equivalent component. It is important that in the palette are reflected only those elements, which can be used inside the current element. Are not reflected the elements, which are already inserted (and they cannot be repeated).
In the figure are above enumerated the commands of menu, which can be executed above the document or the documents, and also can executed tuning editor for the forestalling contextual introduction of elements and attributes into the window of the editor of the code. One of the dialog boxes of tuning editor is shown below.
External depositories of the documents
As an example of tuning program is shown the dialogue of addition ldap of connection to the tree of documents (tree evidently to the left, below under the basket elements ftp sites and ldap sites).
ldap and ftp connection are introduced into the tree of documents for the convenience, usually completely it is possible to use the net connections Windows for the joint operation on documents, by using, for example, web DAV connection.
For guaranteeing the search the construction of the incomplete inverted index is used, and use XPATH in the program is limited by the operations, used in the rare cases.
During the design of program were set the following purposes, which it was necessary to have:
For meeting of last requirement was decided not to use the query languages, such, as XPATH, but use a search approximately in the manner that this is done in different search systems. Descriptions it is specific they search for well according to and it does not require of the user of the language proficiency of demands.
For the implementation was selected the most popular until today method of organizing the index - algorithm of the incomplete inverted file (or as it still they call, the inversion of documents).
Search in the inverted file is conducted according to (although possibly and the task of precise context; however, in the case in question this is superfluous). It is understandable that the inverted file can be used both for the precise and for the illegible search; in this case the task of search in the array of documents is reduced to the task of the search for word in the dictionary and processing of the inverted lists of the obtained terms.
If in the case of precise search treelike structures and hash- indices widely are used, and the calculation of grammatical forms and synonyms is achieved with the aid of the expansion of sample, then already for the "search on the substring" it is possible to use suffix trees.
Search to the inaccurate equality actually leads to full-scan; therefore it is not consciously realized in the program.
Is expected that dictionary will be because of the specialty of terms very limited; therefore the algorithm of the sequential sorting of the terms of dictionary is implemented.
Subsequently it will be possible to realize:
Furthermore, is the modification of method n- gram Vilbur-Khovayko, which authors themselves call the method of the triads (they are used 3- grams or "triad"), it was developed for the minimization of the number of turnings to dictionary. The authors of method proposed to construct the complete set of the terms, which have general triads with the keywords of demand, but then - with the aid of the weight "coefficient of the similarity" of the terms of dictionary and demand - in the stage of reading the inverted lists "to intercept" the unpromising versions (for which the coefficient of similarity will be less than the given threshold value). In this case some terms can be passed; it is present a compromise between the effectiveness of search and its completeness.
In contrast to the usual balanced trees, in trie- tree all lines, which
originate overall, are located in one subtree. Each edge is marked by a certain
line. The words of list correspond to the terminal apexes ("leaves").
In the figure is below shown the window of the program of the builder of the inverted list xml of document.
Retrieval for information with use Boyer-Moore(BM) and regular expressions (URE)
Since XPath does not have means for the modification of document, in the program two versions of search and replacement of the entries of expressions in the documents are realized both in that opened and it is recursive in the catalogs according to the types of documents. In the figure the dialog box of search is below shown. The dialog box of replacement appears approximately also.
The results of search can simply be reflected or in the error, hint and search results window, in the open documents either be caused external programs or function (from the dynamically loaded libraries). Furthermore, the results of search can be used as additional filter (i.e., the formed list of documents then it is transferred to indexer) in the builder of the inverted list.
In the figure above green line below showed the window with the results of search.
Transmission of the information
In Windows 2003 realizes support xml  for the introduction of xml forms; also Microsoft Office package it adapts as for the custody of xml documents.
Obtained data can be transformed with the use, for example, XSLT.
By the important characteristic of documents with the description it is specific animals it appears bibliography. At the moment of designing the program was taken into consideration the requirement to follow one of the taken machine to machine transferring of bibliographical data ISO2709, USMarc, UniMarc of the developed from the end fiftieth it was annual past century.
Furthermore, bibliographical data were used for testing of the productivity of the written subroutine libraries for search and construction of index; since there are large volumes of bibliographical records in the comparison with the taxonomy.
The existing sizes with the small variations are developed by the US Library of Congress at the end of the fiftieth it is annual strip size. Briefly, the file of data of bibliography consists of the continuously following after each other records, divided by the strip markers of records. Each record consists of the title, called by leader, the tables of displacement pour on, called reference book, and strictly given - called fields and subfields.
For example, field the author has a code 100 and include subfield 1- type of name of person, a- name of person, b - dynastic number, c - title, d - date, e - the role of the persons, q - complete name, u - the place of work. Some fields, and subfield, are repeated, pour on lengths and the codes used and the reductions are also described in the standards.
This structure of records makes it possible to sufficiently compact store complex descriptions. However, the conversion of the record format into another internal size, possibly, would make it possible to organize more rapid search, but due to simplification in the structure of record and, correspondingly, loss of the part of the information.
Therefore it is possible to consider reasonable making the decision to store data as there is, in the standard size, but not to develop its own method and the sizes of storage of bibliographical data.
If the selection of the method of the internal idea of bibliographical records was actually dictated by the requirements of the correspondence to the standards accepted, then decision for guaranteeing rapid retrieval for the necessary records of such was more nontrivial.
It is obvious that for guaranteeing retrieval for records on the attributes (to values pour on and subfields) the construction of indices is necessary. However, in the case of solution of problem in general form necessary to consider that a quantity pour on and the subfields, given by standards enormously, their one enumeration composes entire volume.
It is known from the practice that, for example, in similar of highly productive DBMS Btrieve (subsequently Pervasive), that works under Novell Netware operating system the rebuilding of index for one field in the catalog from 100 thousand records occupies more than one workday. In the case of SQL servers the time, required for constructing the index still above.
Comparison of the times of the response
As the experiment was written the application/appendix, which constructs inverted list index for about 10 fields in the catalogs of 3 thousand records and 100 thousand records, working through the BDE driver with Paradox, SQL servers interbase 5 and Oracle 8.
Time for the creation of the inverted list in the case of a small quantity of records comprised on Pentium- II of the order of several or tens of minutes, in the case of standard size into 100 thousand records to wait for the time of the end of procedure during the acceptable time interval did not succeed.
Search in those obtained gave the time of response in the case of Paradox approximately one second, and from the minute to several ten minutes (depending on the complexity of demand) - in the case Oracle 8.In the case Interbase 5 any acceptable times of response obtain could not (during the search on several fields) because of the fact that the SQL query optimizer compiled the incorrect plans, in which the demand was rolled up to the sorting of records (full-table scan), but the construction of plans was by hand senseless - since in the case of using the indices the time of response was unsatisfactory.
The reason for this is sufficiently clear - if in the case of search on one field DBMS is produced only reading, then in the case of using several conditions according to E.Ozkarahan's book  the intersection of many sets of data ('AND' operator in the relational algebra) cannot be executed only by the operations of reading, but the additional temporary tables are created.
Meanwhile the operations of record have the high cost of fulfillment, moreover in the case of multi-user relational DBMS these expenditures due to the support to transaction (for guaranteeing the possibility of rollback transaction in the case of errors) are even more high.
In the course of tests also did not manage without the additional tuning of Oracle RDBMS server, since data of procedure frequently caused the overcrowding of the transactions log. On default within the framework of one transaction for Oracle do not make it possible to put more than 5 - 10 thousand even small records, these are the very known special feature of servers as DB2 and Oracle. The second special feature is locking of records, the subsequent locking of table and the failure of server. Therefore the search for the frequently encountered attributes of publications most likely leads to the failure in the demand, since the server of the data bases cannot create the temporary sets of data of large length.
Interbase, in contrast to another industrial relational DBMS, uses the architecture of multi-generation records patented by DEC. It was developed in the river bed of technologies for the military applications, in particular, the system of information support and automatic targets selection and guidance in the combat group. Advantage not only in the absence of transactions log, but also in the fact that each transaction is executed in the virtual database; as a result the absence of locking is reached. However, productivity of Interbase is insufficient in our case. In order to go around this with the use of the industrial solutions, possible there was to try to use net DBMS, built on the basis of MUMPS technology, or Sleepycat; however, expenditures for their operation or are too great, or the attainable time of response can be expected although better, it is insignificant.
Since the functionality of the created application it is limited only by reading without the modification of data, it was decided to decrease the time of response without the use of industrial means of data control, and after writing its own procedures of retrieval for information with the use of an algorithm of the construction of the inverted list of words and algorithm of quicksort.
From the construction B tree in the dictionary of words it was decided to refuse, since it was assumed that the dictionary will be relatively small, and effect from the use B of trees it will not be. This is correct under the assumption that the majority of records is made in one - two or three languages.
The construction of the inverted list is done into three passes.
Records at first are scanned and the list of words is constructed, noise words are skipped away. The words, which are encountered of the more given number of times, word, are considered noise for simplicity, it is shorter than the assigned length (is assumed that this unions), the number (sequence of the numbers shorter than the assigned length) and the words, whose length that by more assigned. Construction of the lists of noise words fairly complicated, and from the realization of more complex algorithms it was decided to refuse.
The structure of record in the dictionary of words is such:
As is evident, the size of record can be different due to the line with the descriptor of length (so called Mac-string or Shortstring)
Then the list of indicators to the fields and the subfields (or element and their attributes) is constructed. This list connects words in the dictionary with the table of indicators on the record:
Then the thus far yet not filled values of displacement in the table of dictionary are filled up and the table of references on the record is formed.
The described three tables compose the pyramid:
the combination of the word/fields
Word 1.. Word N
field..field N ... field 1..field L
Record 1..Record R... Record 1..Record..D
The simple structure of data permits implementation of a rapid search. Search is done also into several stages.
In the table of dictionary first searches for the longest word. It is assumed that in the language the short words are used more frequently, and it is long less frequent. If this is correct, is reached additional gain. On the initial and final displacement in the table of records search for the references in the field (subfield), in which this word searches for. The list of indicators on the record is obtained.
Then this procedure is repeated for other words, with exception of the fact that from the list of indicators on the record obtained for the first word those references, which do not satisfy the condition of search for the subsequent words, are moved away.
From this it becomes clear, why estimation along the length of word during the first stage is done.
Since the number of records in the worst case will not exceed the assigned limit for the noise words, is completely possible the presence in the memory of the process of the list of references with volume into 60 thousand records, either about 200Kbyte or in these limits without the page failures.
After search is completed, the obtained set of records additionally is truncated to the assigned limit (now it acts limitation into 4096 records), since user nevertheless will not examine their everything, but most likely it will refine the parameters of search.
It is realized multi-threaded ISAPI module for the IIS web server. Bibliographic records are stored in the USMARC format, and xml documents are stored in the file system. The inverted list they are stored in three files. The need for loading into the memory the inverted list no this it makes it possible to easily connect different databases for search.
Search is achieved rapidly, the time of response (depending on the speed of Internet connection) does not exceed one second. Readiness for the peak loads is evaluated as high, one works from the possible now 32 threads, the consumption of memory is minimal, load on the processor is imperceptible.
For example, with a quantity of records by 80 thousand records of the files of the inverted list the dictionary of 3M, the indicator of the fields of 2M, the indicator of the records of 8M (with a quantity of indices 10).
The templates of the pages of search loaded into the memory, which contain the pseudo - tag of this form:
<!-- record ndx=sakha -->
<!-- record -->
For constructing of index, import of data are created applications. The time of the construction of the inverted list on the average is 5 -7 minutes.
Increase in the volumes
Certainly, the inverted index can be sufficiently large. For the decrease of sizes of file it was used the algorithm of detection and removal of debris words, in this case the records in the dictionary actually necessary for the overwhelming majority of demands remain.
The second method consists in the indication of the relative addresses: for each position is memorized not its absolute address, but the difference of the addresses between the current and foregoing positions. For by the forest of effectiveness the file is packed (Golomb's codes and other not very rigid algorithms of packing); however, the effective algorithms of compression are used rarely - the effect of compression it eats itself by the processor time, expended for unpacking of data.
Retrieval for information in internet
An example of the use of retrieval for the records of forms is shown in the figure below and is accessible with the address: http://ensen.sitc.ru/taxon/
For the input of words it is possible to use a sign " * " for the substitution of any symbols.
Search is achieved by xmlndx ISAPI/NSAPI dynamic library loaded into the address space of web server. Module uses the files of the inverted list, created by the program.
For the mapping in to HTML document the xmlndx module looks for xslt file (document of xml, which describes XSLT transformation of document into xHTML)
Also it is possible to examine xml document as is:
For the specific cases is written offline version, the exterior view of main window is shown in the figure below. This program implements one additional COM server, who realizes special protocol ap://,on which occurs the call of the functions ISAPI of module without the need for web server, as due to his emulation.
Analogous program was also written for fulfilling the demands for guaranteeing dynamic content, written to the compact disk and transferred to the third persons.
This makes it possible to easily transfer dynamically created content from web server on the data carriers into other places, without the loss of functionality and without tuning of web server.
Bibliographical search, the online and offline versions are available: http://ensen.sitc.ru/ldbndx/etc/
Download apoo editor from :
this document is accessible with the address:
 Alex Chapman “Directions for the structure of taxonomic descriptive data”, TAXONOMIC DATABASES WORKING GROUP 17th Annual Meeting 9 – 11 November, 2001 Friday, 9th http://www.tdwg.org/2001meet/AlexChapman.htm http://www.tdwg.org/2001meet/mins2001.htm
 Smirnov I.S., Lobanov A.L., Alimov A.F., Dianov M.B., Medvedev S.G. Development of information retrieval systems for zoology // ADBIS'96. Proceedings of the Third International Workshop on Advances in Databases and Information Systems. Moscow - September 10-13, 1996. Extended Abstracts. 1996. Vol. 2. P. 60-63.
 Ron Gilmour Gymnosperms of the Southeastern US - A Premature Sample of the Use of XML in Systematic Botany, Bioinformatics 2000, 16(4): 406-407.
 SysTax- a Database System for Systematics and Taxonomy http://www.biologie.uni-ulm.de/systax/documentation/interfaces/import_dtd.html
 7) DTD for biological collection information, European Natural History Specimen Information Network (ENHSIN) Anton Güntsch, Botanic Garden and Botanical Museum Berlin-Dahlem http://www.bgbm.org/BioDivInf/Projects/ENHSIN/PilotCollectionDTD.htm
 Deriving an XML based format for Taxonomic Information DELTA (DEscription Language for TAxonomy) , International Taxonomic Databases Working Group http://biodiversity.uno.edu/delta/www/standard.htm\
Turn User Input into XML with Custom Forms Using Office InfoPath 2003 Aaron Skonnard
 E.Ozkarahan. Database Machines and Database Management. Prentice-Hall, New Jersey, 1986.
[ Back ]