a global introduction to fichoz...independent research project, prosopography of roman officers, in...

38
1 A global introduction to Fichoz Fichoz was originally developed by a group of French, Spanish and Chilean historians as part of a research project on the political and administrative structures of the Spanish Monarchy (PAPE group). It was latter extended to other historical fields as part of the research programs of the LARHRA's 1 Pole Méthode (CNRS / University of Lyon, France). Various independent databases are presently developed using the same methodology. Insert I. Some databases developped using Fichoz methodology : Fichoz-Ozanam. PAPE group, for a database of persons interacting with Spanish monarchy, XVII-XIXth century, around 0,5M records; ANR GLOBIBER, on XIXth century Spanish colonial administration, colonial and home political disorders Fichoz-Navigocorpus. ANR 2 Navigocorpus, for a database of world shipping movements, XVIIIth-XIXth century. Aroung 0,1M records, growing fast. Fichoz-Montpellier. Independant database of 1200 XVIth c. wills from Montpellier. Fichoz-Tapestry. ANR Arachné, for the history and analysis of tapestries from the middle ages to the XXth century Fichoz-Pompei. User: independent research project, for a study of Pompei electoral painted scripta (Ist Century) Fichoz-Tunisie. CMCU Utique 3 research project, for a study of Tunisia's political structures under French protectorate (XIXth-XXth century) Fichoz-Charleville. User: ANR "Charleville Mobilités, Populations, Familles", for a database of population lists and BMD books (XVIIth- XXth c.) Fichoz_Press. Independent research project by an Austrian researcher on press correspondents in the Spanish Civil war. Fichoz_militares. Independent research project, prosopography of Roman officers, in the late Republican and early imperial periods. The present text is an introduction to Fichoz. In a first part, we expound the general principles and main concepts on which we based our work. We are perfectly conscious that Fichoz does not respect the canons of present day mainstream orthodoxy in database structuring. We nevertheless firmly believe that it is prefiguring the canons of tomorrow's orthodoxy, as some recent developments in database theory show (see Appendix I). This part is in fact a short treatise on what a scientific database for social science is, or might generally be. Such considerations primarily aim at setting Fichoz in its proper context. Nevertheless, we consider Fichoz as a good starting point for a general overview, because of its one main and indubitable quality: it works. It had been able to cope not only with the research program it was first made for, but also, as Insert I shows, with the demands of other researchers in the most various historical fields. All that, by simply extending the original principles on which it was originally based. We must conclude that they are fundamentally sound. In a second part, we describe the table structure of Fichoz. This second part is, obviously, more purpose-oriented. We add two appendixes. The first one is an extract from Pampulim Caldeira (Carlos), A arte das Bases de Dado, com exemplos de aplicação para Oracle e SQL Server , Sílabo, Lisboa, 2011, p. 177-188 4 . Carlos Caldeira shows how the traditionally rigid system of SQL relational databases must give way to far more flexible approaches, exactly what Fichoz intended. The second one describes the necessary steps to access Fichoz on-line. 1 LARHRA: Laboratoire de Recherche Historique Rhone Alpes (CNRS UMR 5190), 14 av. Berthelot, F69007 Lyon. Contact: [email protected] 2 ANR: French acronym. Agence Nationale pour la Recherche. French State funded national research programs. 3 CMCU Utique: French acronym. Bilateral scientific cooperation program between France and Tunisia. 4 We publish a translation of the original Portuguese text to French, which we made for the benefit of our local students. We shall probably give an English translation in a near future.

Upload: others

Post on 28-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    A global introduction to Fichoz

    Fichoz was originally developed by a group of French, Spanish and Chilean historians as part of a research project on the political and administrative structures of the Spanish Monarchy (PAPE group). It was latter extended to other historical fields as part of the research programs of the LARHRA's1 Pole Méthode (CNRS / University of Lyon, France). Various independent databases are presently developed using the same methodology.

    Insert I. Some databases developped using Fichoz methodology:Fichoz-Ozanam. PAPE group, for a database of persons interacting with Spanish monarchy, XVII-XIXth century, around 0,5M records; ANR GLOBIBER, on XIXth century Spanish colonial administration, colonial and home political disorders

    Fichoz-Navigocorpus. ANR2 Navigocorpus, for a database of world shipping movements, XVIIIth-XIXth century. Aroung 0,1M records, growing fast.

    Fichoz-Montpellier. Independant database of 1200 XVIth c. wills from Montpellier. Fichoz-Tapestry. ANR Arachné, for the history and analysis of tapestries from the middle ages to the XXth centuryFichoz-Pompei. User: independent research project, for a study of Pompei electoral painted scripta (Ist Century)Fichoz-Tunisie. CMCU Utique3 research project, for a study of Tunisia's political structures under French protectorate (XIXth-XXth century)Fichoz-Charleville. User: ANR "Charleville Mobilités, Populations, Familles", for a database of population lists and BMD books (XVIIth-XXth c.)

    Fichoz_Press. Independent research project by an Austrian researcher on press correspondents in the Spanish Civil war.Fichoz_militares. Independent research project, prosopography of Roman officers, in the late Republican and early imperial periods.

    The present text is an introduction to Fichoz. In a first part, we expound the general principles and main concepts on which we based our work. We are perfectly conscious that Fichoz does not respect the canons of present day mainstream orthodoxy in database structuring. We nevertheless firmly believe that it is prefiguring the canons of tomorrow's orthodoxy, as some recent developments in database theory show (see Appendix I). This part is in fact a short treatise on what a scientific database for social science is, or might generally be. Such considerations primarily aim at setting Fichoz in its proper context. Nevertheless, we consider Fichoz as a good starting point for a general overview, because of its one main and indubitable quality: it works. It had been able to cope not only with the research program it was first made for, but also, as Insert I shows, with the demands of other researchers in the most various historical fields. All that, by simply extending the original principles on which it was originally based. We must conclude that they are fundamentally sound.In a second part, we describe the table structure of Fichoz. This second part is, obviously, more purpose-oriented.We add two appendixes. The first one is an extract from Pampulim Caldeira (Carlos), A arte das Bases de Dado, com exemplos de aplicação para Oracle e SQL Server, Sílabo, Lisboa, 2011, p. 177-1884. Carlos Caldeira shows how the traditionally rigid system of SQL relational databases must give way to far more flexible approaches, exactly what Fichoz intended. The second one describes the necessary steps to access Fichoz on-line.

    1 LARHRA: Laboratoire de Recherche Historique Rhone Alpes (CNRS UMR 5190), 14 av. Berthelot, F69007 Lyon. Contact: [email protected]

    2 ANR: French acronym. Agence Nationale pour la Recherche. French State funded national research programs.

    3 CMCU Utique: French acronym. Bilateral scientific cooperation program between France and Tunisia.

    4 We publish a translation of the original Portuguese text to French, which we made for the benefit of our local students. We shall probably give an English translation in a near future.

  • 2

    I. Aims and philosophyFichoz is a database system for the global storage of social history data. Social must be understood in its broadest meaning: all kind of data referring to or bearing over interactions between persons, between persons and corporations, between corporations, between persons and the artefacts they produce, between persons and their natural surroundings, between artefacts and artefacts (see below, considerations about the concept of actor)5.

    Insert II. Some examples of interactions processed by Fichoz:Persons with persons: patronage relationship, friendship, community of birthplace, genealogical relationships (without limits), etc.

    Ex. P1, as war secretary, favours P2, a businessman, in obtaining provisioning contracts for the armyPersons with corporations: membership, leadership, cooperation, etc.

    Ex. P1, a graduate of Havard Business School, is head of Northrop aircraft company.Corporation with corporation: merging, splitting, etc.

    Ex. Citroen car factories merge with Peugeot car factories under the name of P.S.A.Artefact with artefact: ship with cargo items, cartoons with tapestries, etc.

    Ex: The ship "Mermaid" carries three tons of salted cod from Benghazi to Beyrouth, and latter sails on, empty, to AlexandriaPersons with artefacts: owner with ship; resident with house; author, publisher, censor etc. with one another and with a book, etc.

    Ex: James Parson's book on Cognitive frictions in God is condemned by William Boulton, a canon of Canterbury, acting in his capacity of theological adviser of the archbishop of Canterbury.

    a) What is a database?A database is a tool. It has no value in itself; it is worth the interactions it makes possible between a user and the data he is handling. No less, no more. For clarity's sake, we'll split such interactions under three heads: storing, retrieving, enhancing.1) Storing, retrieving, enhancing, processing excludedA database is, first and foremost, a tank, a reservoir of data. Nevertheless, there is no agreement among researchers on the fact that it should only be such a tank, and nothing more. Many dream with a tool which would not only allow to store and retrieve data, but also to process them through built-in processing tools. We shall stress latter that processing is, in some way, inherent to storing and retrieving. In a sense, data stored into the data base have no previous existence. They are created in the act of storing them by transforming raw sources, by structuring in a new way an information raw sources carry, so as to bring to front arrangements and connexions which make possible an easier management of the same; arrangements and connexions which existed before, but which the previous disposition of the source kept in a discreet background.

    5 Latour (Bruno), Reassembling the social. An introduction to actor-network-theory, Oxford, Oxforf University Press, 2005, X + 301 p.

  • 3

    Insert III. Transforming a source into data, an example:Source: Now the serpent was more subtil than any beast of the field which the Lord God had made. And he said unto the woman: "Yea, hath God said, Ye shall not eat of every tree of the garden?" And the woman said unto the serpent: "We may eat of the fruit of the trees of the garden. But of the fruit of the tree which is in the midst of the garden, God hath said, Ye shall not eat of it, neither shall ye touch it, lest ye die". And the serpent said unto the woman: "Ye shall not surely die. For God doth know that in the day ye eat thereof, then your eyes shall be opened, and ye shall be as gods, knowing good and evil". And when the woman saw that the tree was good for food, and that it was pleasant to the eyes, and a tree to be desired to make one wise, she took of the fruit thereof, and did eat, and gave also unto her husband with her; and he did eat. And the eyes of them both were opened, and they knew that they were naked; and they sewed fig leaves together, and made themselves aprons. And they heard the voice of the Lord God walking in the garden in the cool of the day: and Adam and his wife hid themselves from the presence of the Lord God amongst the trees of the garden. And the Lord God called unto Adam, and said unto him, "Where art thou?"...

    Data loaded to the database: [God] gives a law to [Adan] - "We may eat... lest we die"[God] gives a law to [Eva] - "We may eat... lest we die"[Serpent] tempts [Eva] - "Now the serpent... good and evil"[Eva] eats the forbiden fruit - "And when the woman... and did eat"[Eva] convinces [Adam] - "And gave..."[Adam] eats the forbidden fruit - "And he did eat"[God] tries [Adam] - And they heard the voice... thou"[God] tries [Eve] - "And they heard the voice... thou"

    Comments: we selected the actors involved (names written between brackets) and the actions they carry on as a basis on which to articulate the data. This choice is consistent with the needs of historical analysis. Not so much with those of stylistic analysis: we break the continuity of the text and lose the possibility to detect the stylistic devices which create the dramatic quality of the story. Transforming means choosing and in some way or other renouncing some dimensions of the source.

    Rearranging the source into data can be done in a variety of ways. Nevertheless, researchers of a same field use to agree, at any given moment, on the best way to do it in order to maximize results. Such a coordination is made possible by the fact they partake in the same dominant scientific paradigm.

    Insert IV. Recent historiography, a background for using actors as a pattern for transforming a source into data:Individuals were for long considered as the almost unique prime movers of history, and history itself as a tool to determine the relative dignity of great men through a measure of the incidence they had on their surroundings. As a reaction, European continental historiography - and specially French historiography - built in the second half of the XXthe century a model from which the individual was practically banned, in favour of a the dialectic confrontation of big social masses, made up of men, but also of ideas, of cognitive tools, of ecological changes, of economic trends, etc. In a third and present stage, history interpreted social dynamics as the result of an almost infinite number of micro-interactions between actors and objects, an option nowadays common to all social sciences. The position of the objects within such interactive networks and their efficiency at creating relations between persons make them quasi-actors. In Insert III, we could reasonably have raised the forbidden fruit to the status of an actor, and made it an intermediary between Adam and Eva6.

    Processing the resulting data to extract scientific conclusions means rearranging them once more, in a way which cannot be the object of any previous agreement. This impossibility derives from an essential characteristic of research. No researcher knows beforehand which process will be the most efficient at extracting new knowledge from the data. He proceeds through attempts and failures, guided by his own experience and a collective sapience accumulated by the corporation he belongs to. This accumulated experience is mostly negative. It instructs on what has failed: had it succeeded, no new discovery would be possible in that particular field, using that particular working hypothesis, and research would be pointless. Discovery almost always comes through the use of unexpected re-structuring tools. Some are absolutely new ones. Some are well-known implements newly imported from other fields to the domain under investigation. This implies that equipping beforehand a scientific database with a set of processing tools only makes it cumbersome, and even misguiding: the sheer presence of seemingly efficient implements - the fact they were chosen to equip the database means that they achieved some previous success - puts a limit to the researcher's personal initiative and inventiveness. We conclude

    6 Sandstrom (Greg), "Actor-Network Theory", ISCID Encyclopedia of Science and Philosophy, bef. 2010, http://www.iscid.org/encyclopedia/Actor-Network_Theory, 8/2010; see also: ""Théorie de l'acteur-réseau"", Wikipedia, bef. 2010, Wikipedia (French) (23/8/2010); and: Dedieu (Jean Pierre), "Une nouvelle approche de l'histoire sociale: les grandes bases de données", Sciences de l'homme et de la société, 2003, N° spécial "Vie de laboratoires", n° 66, p. 35-38; Latour, op. cit..

  • 4

    that a research database must be equipped with all that is necessary to store, retrieve and enhance data. Not properly to process them. It must nevertheless be able to export them easily to any downstream processing package.

    Insert V. Exporting data to a processing package:Most processing data packages are nowadays able to read tabulated input data.

    Ex: Province [tab] Cities [tab] Boroughs [tab] VillagesSevilla [tab] 2 [tab] 15 [tab] 89Madrid [tab] 0 [tab] 25 [tab] 165

    would be read by Access, Excel, OpenOffice Calc, FileMaker and any other modern datasheet, mapping and statistical package as an array of four columns and three lines, the first of which would be considered as a title line. Creating a tabulated sequential file from a database based on fields and records is in most cases quite easy.

    Some other downstream processing packages demand a more complex formalization. Such as Pajek®, a well-known network analysis tool:Ex: *Vertices 10

    1 000003-ABARCA_BOL ic Yellow 2 000228-SOMODEVILL ic Blue 3 000339-GOYENECHE_ ic Yellow 4 000365-HERRANZ_TO ic Blue 5 000550-LOPEZ_PACH ic Blue 6 000559-MUZQUIZ_GO ic Yellow 7 000630-VILLAMARQU ic Blue 8 000685-GUEMES_HOR ic Yellow 9 000737-SALABERT_A ic Yellow 10 000864-VASIANA_J ic Blue *Arcs 1 2 1 c Blue3 2 1 c Blue1 5 1 c Violet8 9 1 c Blue6 1 1 c Blue6 2 1 c Blue6 3 1 c Blue1 4 1 c Blue1 7 1 c Violet3 7 1 c Blue7 9 1 c Blue7 2 1 c Blue3 4 1 c Blue3 5 -1 c Red

    In such a case, it is necessary to create an intermediary file in which data will be imported as sequential text in tabulated form, and latter transformed according to the requirements of the package.

    2) Storing dataThis conclusion raises a new, although related, question: would this storing tool be purpose-oriented and temporary, or rather permanent? If temporary and purpose-oriented, the storage would have to preserve validity just as long as the research program for which data were gathered; if permanent, not only should the storing hardware and software physically last far longer - a span of thirty years is usually considered as a standard 7, although it looks too short for research purposes - but data should also be structured in a way which would make possible their use for a variety of purposes, that is by a variety of research programs, which might be quite different from the original one. If not, preservation would be pointless, except as a means of checking the conclusions drawn from the data by studies based on the same, an important goal, but a rather limited one which, given the pace at which

    7 Such is the duration recommended by the French institute in charge of long term preservation of scientific databases, Adonis. About Adonis see: http://www.tge-adonis.fr (12/09/2011).

  • 5

    scientific literature grows obsolete, would hardly justify the costs. Anyway, the difficulty in structuring data within a scientific database primarily arises from the fact that the use which any research program will make of its own data is highly unpredictable, even within the scope of the planned original program. The building of the database is of course oriented by working hypothesis as to research paths latter to be developed; but a capacity of invalidating working hypothesis is precisely the mark of innovative research. We can then safely conclude that a research oriented database must be built as a permanent non purpose-oriented tool, even when it is not intended to be used by any other research program than the one it has been planned for. This point has far-reaching implications. It means that data in the database must be structured along more basic and low level criteria than those which any presently conceivable research program could possibly be based on. We shall see further, when discussing the respective merits of data splitting by means of fields or of beacons, that a generic and unique model of database is impossible to conceive, and that such an impossibility probably derives from essential characters of historical data and not from the present state of computer technologies. This remark does no invalidate the concept of multi-purpose databases. It only restricts the scope of the multi-functionality assigned to each database to a broad field of globally similar data. We expect to conduct further research on this point in a near future.3) Retrieving stored dataRetrieving is obviously all-important: storing without retrieving would be absurd. What matters here is flexibility. Because of the uncertainties inherent to any research (see above), a researcher must be able: 1) to access the data he is handling from whatever angle he thinks fit, which means that starting from any part of the database, he must be able to reach any information stored in any other part; 2) to access them instantly, or at least as easily and as fast as possible. Using a more sophisticated vocabulary, with as little friction as possible.

    Insert VI. About friction:Friction is an every day more important transversal concept in social sciences, not to mention its paper in engineering. It covers all factors which impair the smooth working of a mechanical or intellectual system. In modern geography the degree of friction, that is the relative ease with which actors, cultural objects and goods are able to move within a given space, is considered a highly relevant factor to characterize a particular space. In cognitive sciences friction characterizes an information set and denotes all factors that prevent a user to retrieve from the same the relevant information he needs. As such, friction is related to the concept of limited rationality, of which it is a fundamental component8.

    Friction cannot be eliminated9. It must be nevertheless as evenly spread over as many conceivable descriptive dimensions of the data as possible. The most important point is to limit unevenness. In fact, if the database gives an easier access to any dimension than to any other, this favoured dimension will necessarily be given preference over other explanatory factors.

    Insert VII. How friction creates bias:A personal experience. Let the reader trust me. As far as I know the fact has not been published and I don't wish to expose reputations by giving too detailed an account. A database about provincial governors. Vertical genealogies (parents, grand parents, great-grand-parents) had been imbedded into the database and such relatives could easily be accessed through simple queries, starting from any other mentioned actor. Horizontal relatives (uncles, brothers in law, cousins...) had not. In fact, such relationships are far more difficult to process than vertical ones. The relevant data about horizontal genealogical relationships had been stored to remarks fields. They could be accessed, but only through specific queries and the retrieved data could not be mixed with others. So that accessing them was subject to a high degree of friction. The conclusions of the study were exactly what this disposition implied: in explaining career, it stressed that vertical family relationships were all-important, horizontal relationships of little consequence. All other known studies conducted on similar problems decisively conclude the other way round. In the present case, the way in which the database had been build had introduced so strong a bias as to falsify results.

    Researchers are, of course, trained to counter such biases, and they need this training to overcome unavoidable disparities. But a well-planned database should minimize them. We'll split the problem of flexibility stricto sensu under two heads.- First of all, the software used to build the database should make possible a huge variety of complex queries without demanding from the user high-level programming capabilities. A research database is necessarily complex, and complexity prevents simplicity. Formulating a complex query will never be a simple thing, from a purely conceptual point of view. We nevertheless all know that some database packages make it easier, and others make it more complicated.

    8 Kasper (Wolfgang), Streit (Manfred E.), Institutional Economics. Social order and public policy, Cheltenham, The Locke Institute, 1999, XVII + 517 p.

    9 Any cognitive operation implies a level of friction. Only God may be defined as a space without friction.

  • 6

    Insert VIII. Queries:In some database you just enter a "query mode", usually by pushing the Ctrl F key, and you write to the relevant field the character string you want to retrieve. Even so, some queries may be complex.

    Ex. From a database with three classes of actors, two of which are characterised by the presence of a special letter at the end of the identifier, being the third one characterized by the lack of such a letter (0000056C, 0003425L, 00765391), find all actors of the third class. You must write three queries: the first one will select all those record in which an identifier exists ("*" extant in the identifier field); the second one will eliminate from the selected set all those records which terminate in "C" ("*C" into the identifier field, and "Omit" parameter on); the third one will do the same for those records which terminate in "L".

    In other databases you have to write a SQL query every time you want to retrieve data. They demand some limited but real abilities at programming (use of brackets, understanding of what a variable is, etc.). So that you generally choose to stick to the pre-set queries which the administrator made ready for you. And by doing so, you most probably loose your point.

    Ease in formulating the query is a fundamental point when complex an unstable documents are at play (see below xxx). It is not the only issue to be considered. Some packages are able to retrieve almost instantly character strings or values from any position within a field, almost independently of the bulk of the data processed; others demand the queried string to be stored at the beginning of the field to process huge volumes of data with swiftness. If such is the case, a way out consists in splitting the data into as many tables as possible, that is implementing a high degree of fine-grained granularity. The problem lies in the fact that granularity is not merely a formal character. It carries information. The fact that two data segments belong to a same field means that they are related by a strong univocal relationship. Preserving such a physical link is a highly efficient way of managing this connexion. If both segments have been stored into different fields, you must create a third entity to store information about the link which unites them. With two major drawbacks: complexity and insecurity. The first point is obvious enough. It is the more embarrassing because, as we'll further see, containing complexity is one of the main challenges posed to social sciences-oriented database design. The second point needs some extra comment. Splitting data to numerous tables means that you must split the content of the source to a low granularity level and make the link between the resulting parts explicit again in form of linked tables. The problem here is that in historical research, the nature and strength of the substantial relationship between various pieces of information mentioned within a same document is not always obvious. Defining substantial relationships is in many cases the aim of the research operation. Premature splitting and linking again of data at a small granularity scale, on the basis not of formal criteria (verbi gratia the mention of all linked data within the same document), but on the basis of the substantial meaning of the same, is liable to cause serious errors and must be, as far as possible, avoided.

    Insert IX. A personal experience on the dangers of premature action:I grew aware, after many years of using legal documents for historical purposes, that the way in which they described (name, age, dwelling place, signing ability, etc.) was highly context-dependent; that the description of a same person underwent significant changes at short notice from one document to another; and that these changes were significant to catch the meaning of the document. I was quite happy then that my databases, by maintaining a coarse grained granularity on that point, that is by copying this information "en bloc" to one field attached to the document, had preserved a link between this description and the document in which it had been formulated, and had not split the information into as many independent pieces linked to the relevant actor only, as vulgar experience suggested. I had not done it through any stroke of genius; rather from laziness. It was, nevertheless, quite instructive an experience.

    - Insert IX underscores the fact that software capabilities, although important to enhance performances, are only a part of the story. The arrangement of stored data within the database is far more basic a point. In fact, splitting up and distributing data among tables, records and fields - whatever be the way in which this distribution has been worked out, either by splitting physically character strings into separate fields or by marking them with segregating beacons - is, in my view, the most important point of the art of operating databases. I use to call this operation "atomization", to enhance the fact that every piece of the split data must be self-sufficient, form a close knit significant unit and at the same time refer to one easy-to-identify concept. Preserving the usefulness of the split data for a variety of research programs demands, moreover, this partition to be fairly low-level. Moreover, historical matters demand that the implicit information carried by the fact that various informations are mentioned within the same documentary set be preserved, as quite important a clue to reconstruct social reality from partial sources.How to split data within the database? Using fields or beacons? In order to preserve documentary units, some database packages tried to implement the "beacons" technology into historical data processing. The most systematic approach on that line was Manfred Thaler's Kleio10, in the 80's of the last

    10 Thaller (Manfred), Kleio. A data base system for historical research. Version 1.1.1, b-test Version, Göttingen, Max-Planck-Institut für Geschichte, 1987, 127 p.

  • 7

    century. The beacons technology has since then been highly developed as a tool for discourse analysis, fundamentally in literary and sociological matters. The use of beacons is made necessary when the structure of the source - be this "source" a text or a graphic object; and "structure", its formal structure or its conceptual organization - is the object of the research. Beacons are also a practical and flexible tool to move separate parts of the source from one part of the database to another, and as such play an auxiliary role in relation to the fields technology. Beacons are good at formal analysis, be it rhetorical, grammatical or conceptual. Their strong point is that they preserve the source in its formal aspects and internal connexions; obviously a basic requisite in many research fields (see Insert III). But this strong point becomes a drawback when the document is not studied for itself, but as a source from which to extract sequences of actions carried on by actors. An aspect, precisely, on which social history is based.

    Insert X. Splitting a source by means of beacons:[Sq01]Now the [A1]serpent[A1] was more subtil than any beast of the field which the [A2]Lord God[A2] had made. And he said unto the [A3]woman[A3]: "Yea, hath [A2]God[A2] said, Ye shall not eat of every tree of the garden?" And the [A3]woman[A3] said unto the [A1]serpent[A1] : "[A3] [A4] We[A3] [A4] may eat of the fruit of the trees of the garden. But of the fruit of the tree which is in the midst of the garden, [A2]God[A2] hath said, Ye shall not eat of it, neither shall ye touch it, lest ye die". And the [A1]serpent[A1] said unto the woman: "Ye shall not surely die. For [A2]God[A2] doth know that in the day ye eat thereof, then your eyes shall be opened, and ye shall be as gods, knowing good and evil"[Sq01].

    [Sq02]And when the [A3]woman[A3] saw that the tree was good for food, and that it was pleasant to the eyes, and a tree to be desired to make one wise, [A3]she[A3] took of the fruit thereof, and did eat,[Sq02]

    [Sq03]and gave also unto her [A4]husband[A4] with her; and [A4]he[A4] did eat. And the eyes of [A3] [A4] them[A3] [A4] both were opened, and [A3 ] [A4] they [A3] [A4] knew that they were naked; and [A3] [A4]they[A3] [A4] sewed fig leaves together, and made themselves aprons.[Sq03]

    [Sq04]And [A3] [A4] they [A3] [A4] heard the voice of the [A2] Lord God[A2] walking in the garden in the cool of the day: and [A4]Adam[A4] and his [A3]wife[A3] hid themselves from the presence of the [A2]Lord God[A2] amongst the trees of the garden. And the [A2]Lord God[A2] called unto [A4]Adam[A4], and said unto [A4]him[A4], "Where art thou?"...[Sq04]

    Comments: We split the text into four sequences (more would have been possible) and marked every actor as A1, A2, etc. All these part have been delimited by special labels which we write here between brackets, and which the package would interpret as beacons. The computer would treat the segments between beacons as separate units, and use the content of the beacon as a label to name them.

    We consequently suggest, when usual social analysis is at play, to split and rearrange source data to records and fields. In special cases, in which the structure of the document is unusually complex (eg. trials), both methods can be combined to cope with this complexity11. We also suggest to split records into actions, so that each record should correspond to one action and each field to one descriptive dimension of the action.What is an action? An action is a move made by an actor, at a precise point of space and time, manifest to an external observer, a move which changes the actor's surroundings and/or his relationship to the same. An action can - and must - be described under four heads or dimensions: who, what, when, where. Recent developments in social sciences make necessary a fifth one: with whom12. Each record assigns a unique value to everyone of these five dimensions. If this uniqueness is impossible, then the corresponding number of new records must be created to re-establish it. From now on, we'll conventionally call this fivefold set of dimension a 5W pattern.The problem is that very few sources deliver events pre-formalized in such a way. The answer to the five questions we alluded to is usually scattered among various sentences or pages; sometimes it remains implicit, as the author of the document trusts readers (or seers, if a graphic source is concerned) to fill the blanks in line with the cultural practice of their community. When such is the case, and it almost always is the case, beacons do not work. Better said, they do not work in a practical way. It is far more expedient to construct artificial independent units, each one of which should stand for an event, as in Insert III. Each one of these units should describe the event under the five above mentioned dimensions, each one stored to a labelled container, that is a field. Each set of five fields would then form a record. This is the way Fichoz proceeds (see part II, further ahead).

    11 We are presently developing this point from a practical point of view with the help of Lorena Alvarez, a student from the Universidad de Cantabria, by now involved in a thesis on various trials of the XVIth century under Prof. Tomás Mantecón's supervision.

    12 Lemieux (Vincent), Les réseaux d'acteurs sociaux, Paris, PUF, 1999, VIII + 146 p.; Watts (Duncan J.), Six degrees: the science of a connected age, New York, Norton Books, 2003.

  • 8

    Insert XI. Don't do it the obvious way:Formalising behaviours to a database must be done on the basis of the action, not on the basis of the actor. The actor is the most natural and obvious unit to be though of as a starting point, just because it plays this same paper in our daily life. Actions are naturally referred to actors and we define social life as a play between actors, being the actions predicates of the actors, and not the other way round. This approach poses two serious sets of problems.

    The first one is mainly technical. If you create a record for each actor and a field within this record for each action, as many databases intend to do it, you'll have to define beforehand a closed list of actions to create the corresponding set of fields. That means that you'll take into account only a predefined set of actions. Which grossly infringes the principle of generality we formulated above and simply invalidates the process from a scientific point of view.

    The second drawback has a more epistemological character. If in daily life the actor comes first, things are not so clear for social sciences. Daily life is based on empathy, a capacity to adopt for oneself the other's point of view, a capacity to see life with the other's eyes. The social scientist behaves in another way. He has no direct access to the subject and, in most cases, emphatically rejects empathy as based on an injection of personal views into the behaviour to be observed. Observed actions come first for him. The actor is only a sum of all the observed actions he carried on.

    Each one of both reasons is enough to choose and ground the database on the action, not on the actor. The actor must be considered as an object described by the series of the actions he is implied in. The database will obviously comprise mechanisms which make possible building such series in a simple way. Identifiers and dictionaries are fundamental from this point of view.

    Proceeding on the basis of the action, we adjust historical data to current and robust technologies. The problem is that by doing so, we are loosing the original organization of the information and by introducing into the same much extra information of our own. We shall now show that such a tempering with the source is co-substantial with scientific practice and is not in itself invalidating, at least when it is done following the "règles de l'art".4) Transcoding and enhancing stored dataIn fact, atomizing information into records and fields does not invalidates data, but defines them as such data. We must develop here an idea we already mentioned. Data are always information. But not all information is data. Data are information plus meta information, data plus machine readable markers which allows the computer to handle every piece of information as a black box, without taking into account its content while handling it. Using fields or beacons are two different ways of enacting a same operation. None can be said to be intrinsically better. Both of them, anyway, are tampering with the information delivered by the source.The source itself, in fact, does not deliver neutral information. We do not refer here to any neutrality of the content in relation to supposedly "hard" facts - quite an important question, but out of reach of the computer. We refer to formal neutrality, in the sense that any discourse means imposing to the information conveyed a set of extra markers to make this information communicable: by using a language, which is the most basic of these markers; but also through a set of rhetorical artefacts to code one's perceptions into messages. Any communication, from the receiver's point of view, means decoding the perceived message so as to extract the underlying information. Viewed in this way, atomizing data can be described not as coding, but as transcoding, changing a code to another code, in a threefold operation:

    - Understanding the source's message at a global cognitive level (understanding of each word and of the story they are relating);- Decoding the source's message to get hold of the information it is carrying;- Encoding the source's message into another coding system which makes it machine-readable; that is reducing it to a set of characters strings, each one of them labelled in a particular way to define the paper of the enclosed information within the global frame of the message.

    Recasting the information either into fields or into character strings divided by beacons is a matter of convenience. It depends on which side of the information is to be given precedence in the research process. Both ways are tricky. In both cases the operation must be performed by high-level experts, with a perfect knowledge of the subject and a capacity to fully understand the meaning and implication of the sources involved and of the information conveyed. With a capacity also to stop short of introducing personal views in the process and to limit themselves to criteria fully accepted, beyond reasonable doubt, by the scientific community to which the data are to be addressed; criteria, for another part, not liable to be changed in a foreseeable future. These characters warrant their robustness, and that the transcoded data will remain useful in the long run.Being a basic operation, transcoding is also a fundamental one. Reversing decisions made at this stage is virtually impossible without loading the data afresh to the database.Transcoding is not the only necessary step. Information must be completed, especially when using fields, which implies a high degree of homogeneity as far as the content of a same field is concerned all over the database. Each bit of separated information, each field, must be made autonomous, so as

  • 9

    to be used alone if necessary, in order to increase flexibility in data processing. That means for instance that a date must be explicitly provided where none is given: an undated historical document is simply useless; that mentioned actors must be identified, a name provided when none is mentioned and a univocal identifier added to identify the actor among all those who use the same name. This also means that what information remains implicit in the source must be made explicit as a data, and transcoded as an event even if it does not appear as such in the source. For instance, a mention in a will, for mere identification purposes, that such a woman (W) is the wife of such a man (M), means creating in the database a marriage event linking M to W. And so on. Such an enhancing process is very similar to that which editors of historical documents implemented in the great editing enterprises of the XIXth century, such as the MGH13 or the CIL14. We got our inspiration from the rules they elaborated then which, with slight changes, adapted themselves perfectly to electronic databases requirements. Extra information they stored as notes or introductory comments, appendices and institutional dictionaries. We use the same techniques. The most obvious difference is that the content of notes is now usually embedded into the main data set. The problem consists in not introducing biases. We'll show further with a variety of examples how it can be done in a practical way.

    Insert XII. Enhancing data: an example of how to identify an actor by means of an identifier under the various identities the primary source gives him:

    This example is taken from Navigocorpus database, which records shipping movements. Data about a same ship are extracted from a variety of local port registers, which spell the ship's and her captain's name every time in a different way. The database preserves these spelling, but gives the ship and the captain a unique identifier.

    Captain's name Identifier Ship's name Identifier Homeport Geo. identifierArrisse, Nicolas 00001471 Aimable Pélagie 0001281N Renéville A0209567Nevince Nicolas 00001471 Aimable Pélagie 0001281N Bernéville A0209567Novissie Nicolas 00001471 Aimable Pélagie 0001281N Reneville A0209567

    Insert XIII. Enhancing data: making an entry self-sufficient:Text of the source: AHN Estado, lib. 34, f. 17 r, A list of magistrates of Valencia (Spain) high court: "1725. Oidores de Valencia. Manuel Ramos, Pablo Molina, Francisco Andres, Pedro Martínez, Luis Herrera."

    Transcription into records of the database:1725

  • 10

    As a conclusive remark, we'll stress that transcoding is of special relevancy for historical databases, the topic we have presently in mind. It is specific of historical studies - and partially too of literary studies - that, in most cases, their primary sources have been elaborated by persons who did not imagine any possible research operation based on the same; by actors who were not trying to transmit any information on their social world to third parties. No father, when writing his will, keeps an eye on the use historians could make of the same; and when an actor consciously leaves traces in view of posterity, by writing his Memoirs for instance, he usually is not at his best and this kind of sources must be counted among the less reliable ones. So that the codes which the document is based on are those in vigour within a small community of actors, who do not try to make themselves understandable to a larger community. This lack of previous externally-oriented coding obviously demands a higher degree of transcoding, and makes it more difficult. Things are different when matters to be transcoded, such as a sociological inquest, have been elaborated in view of a specific research; or when the source has been specifically planned to transmit information to third parties, such as statistical files. The codes in play then are those of a larger, and often a scientific community. The degree and modalities of transcoding are consequently lighter. In some cases, the researcher does not even grow aware of the fact that, in spite of all, he is transcoding.5) Permanent codingWe use to write down to the database events as they figure in the source. We just expand their formulation in the enhancement process, to make explicit an information which in the source remains implicit, embedded as it is in its context. Once formulated in such a way, data are not unambiguous and queries on the same must either be formulated in a very complex way, either return inaccurate results.

    Insert XIV. Results of a query supposed to return for a list of Councillors of Castille:

    The query was: "Consejero Castilla", in the middle field. All the records retrieved comprise the words "consejero" and "Castilla", but include data which exceed the limits of the queried object, such as deaths of Councillors of Castille, or mentions of relationships between actors in which Councillors have been mixed.

    To remedy such inconveniences, every record must be equipped with an unambiguous coding sequence. Queries made on the coding sequence will return unambiguous results. Insert XV displays the coding sequences of Insert XIV's content (second field on the right side).

    Insert XV. Permanent coding strings for Insert XIII:Permanent coding

    We use a fairly complex system, which, apart from disambiguation, provides clues to the institutional context and a classification of the data within the same. Let us analyse the coding for Councillor of Castille (FFEA-AKAAD-CAxx-xx). Each position has a meaning. The same letter in a different position has a different meaning; "x" marks empty positions.

  • 11

    Insert XVI. Permanent coding. A Councillor of CastilleF: Old regimeF: Political royal institutionsEA: Spanish monarchy-: visual separatorA: councilK (after FFEA-K): Council of CastilleAA (afeter FFEA-AK): general offices of the Council of CastilleD: councillors (A in that position and context would mean "president"; B, "governor"; C, "regent"; F, "King's prosecutor", etc.).-: visual separatorCAxx: geographical location; here: court.-: visual separatorxx: empty positions which would, the case being, mark the way the agent is holding his position (honorary, temporary, for live, hereditary, etc.).

    This kind of hierarchical coding makes possible easily to retrieve all positions belonging to any Council of the Spanish Monarchy ("FFEA-A"), or to the Council of Castille ("FFEA-AK"), or to a position hierarchically similar to that of a councillor ("D" in 10th position), and many queries more.Such a coding is based on an objective reference, the organisational chart of the Spanish royal institutions; a reference independent of the user's mood or demands. For that reason it has been implanted within the data base, as part of the same records which hold the data, in a permanent way. And for that reason we call it "permanent coding".6) On-the-way coding and dictionary tables"On-the-way" coding is quite different. It marks actors, actions, documents or places, that is anyone of the elements which compose the database, after criteria decided upon by the user in function of the demands of the research he is presently carrying on. It makes possible an easy access to complex sets of data, and provides markers to create classes which did not previously exist in the data.

    Insert XVII. Some examples of on-the-way coding:Military servicemen in Ozanam (Ozanam is a database on actors related to the Spanish Monarchy in the XVIIIth century).It is debatable if "military servicemen" form a relevant class to interpret XVIIIth century social world, as they undoubtedly do at the end of the XIXth century. Nevertheless, for a variety of purposes, it may be interesting to locate easily such actors and their actions, even in the XVIIIth century. Not being a universally accepted interpretative class, this category must not be inserted into the permanent coding string. In fact, the user will select sets of actions which, from his point of view as a researcher, denote that the relevant actor belongs to the "military servicemen" class. He will mark all relevant records with a marker he chooses, let us say for instance "Military". Another researcher, working with the same data, could mark as denoting a military context other records. It does not matter. This field has been made for temporary and user-dependent coding, and its content can be changed on demand. We'll see latter how to resolve technical problems arising from this transiency.

    In fact, on-the-way coding strings are not stored into the main table, but into a special table, which mirrors the main table, a mirror table which we call a "dictionary" table. A dictionary table is composed of two fields: a record identifier and a coding field. The record identifier creates a link between the dictionary record it is part of, and a record of the main table. Its value is the same as that of the identifier of the object of the main table which the Dictionary entry is describing. In fact, the records of the dictionary table have been created beforehand. As soon as a new identifier is used by a new record of the main table, the link with the dictionary entry is automatically created; so that the On-the-way coding strings stored to the dictionary can be directly accessed from the main table, and the other way round. Storing on-the-way codes to an independent table, a table that can be easily created and renewed, gives the user a perfect liberty at creating, changing and erasing on-the-way coding strings. Whatever he does in the dictionary, does not affect the main table. Moreover, if only one dictionary of any given can may be linked to a same data table at a given time, various versions of a same dictionary may be prepared beforehand and stored unlinked, ready to be connected on demand. To link an existing dictionary to an existing main table, it suffices to give the dictionary table the name with which the link has been declared among the parameters of the main table. To unlink it, it suffices to change this name to any other. So that various dictionary can be in existence at the same time, each one belonging to a different researcher and collecting his own specific on-the-way coding strings. They can alternatively be linked to the main table, and in such a way alternatively activate different sets of on-the-way codes15.

    15 If by now, only one dictionary can be linked to one main table at the same time, a further development will probably allow users to put in function at the same time as many

  • 12

    Insert XVIII. Entries of the actors' dictionary:

    From left to right: first column, record identifier (linking field with the main table); second column, actor's name (called by the link from the main table); third column, gender (manualy set); fourth column: on-the-way coding strings: "xxx", empty; "Milgrad": the actor was, at some moment, a military serviceman; ObEsp: the actor was, at some moment, a Spanish bishop; Curaparro: the actor was, at some moment, a parish vicar; Canofi: the actor was at some moment an "office canon" in a cathedral; "Ultram": the actor served at some moment in Indies, etc.

    Using dictionaries resolves another problem derived from the fact that different actions have been stored to different and independent records. This way of doing things makes fairly easy to retrieve actions carried on by different actors, either identical actions, or actions belonging to the same class, either by means of the permanent coding string or by means of the on-the-way code. However, it makes difficult to retrieve actors. The actor does not exist by himself. He is a collection of actions. So that he is spread - to tell it in some way - over various records; some of which store the character or characters used as criteria for the query and thus can be directly accessed by the query, some others not, and so cannot be directly accessed. To make possible the retrieval of actors, Fichoz comprises a dictionary table linked not to actions, but to actors. The linking field of the main table is not the record identifier, but the actor's identifier. In that way, the content of the Actors' dictionary entry referred to a same actor, can be displayed in a linked dictionary field along with every Actions table record referred to that actor.

    Insert XIX. Actions entries of a same actor, all of them marked with the same on-the-way coding stringsId Actor's name Action Date Actors' dictionary coding string

    To retrieve the complete life courses of a set of actors characterised by any character, users must first select all entries of the main table which match the criteria of the research; then earmark the corresponding Actors' dictionary entries with an on-the-way coding string. This being done, this coding string will automatically be displayed in a linked field from the Actors' dictionary along with everyone of the main table entries referred to the concerned actor. A query based on the linked field will display all the records of the Actions table linked to any of the actors characterised by the queried character. Sorting the same by actors' identifiers and the dates will order the retrieved record in such a way as to be read as actors' life courses. The same operation can be repeated to select actors characterized by various criteria. The corresponding markers can be appended one by one to the on-the-way string and queried one by one or together from the same16.

    dictionaries as users, independently from one another. This is technically feasible.

    16 The similar result could be reached through the Join command of SQL. But planning a database for SQL would introduce a degree a rigidity and opacity we consider incompatible with scientific research.

  • 13

    b) Using a data base. Guidelines and implications1) Quantitative vs qualitative?Many social scientists are still convinced that databases serve fundamentally for quantitative studies; that computers were made to compute digits, and nothing more. There was some truth in this assertion thirty years ago, when machines had so little memory space that everything had to be compressed and abbreviated to avoid constant overflows: digits used far less memory than letters, and were consequently easier to handle with an electronic devise.

    Insert XX. Memories of ancient times:My first personal computer, a Commodore I bought for Christmas1981 at the price of a small car, had 36 Ko RAM. I also bought an extension to 44 Ko RAM, which was a sybaritic luxury. From 1978 on, Michel Demonet, the engineer and colleague to whom I am indebted of all I know about computing, processed for me 8500 trials of the Spanish inquisition using only 252 Ko RAM of the CNRS mainframe of Orsay. A multi-factorial analysis, which was absolutely central in my work, demanded more. The Ko per second was so expansive that we had to do it on a Saturday 14th of August, in the afternoon, being the 15th a major festival in France, and everybody on holidays, so that what of Orsay Centre charged was at the lowest. We were alone on France's most powerful civil mainframe along with a team of Grenoble, who was running a gigantic (for those times...) program of atmospheric modelling. We saved a lot.

    No technical argument can be produced on such grounds nowadays. Nevertheless we still are confronted with remarks of the kind, even from colleagues who massively use their computer for text processing, which is not precisely a quantitative process. The reason probably lies in an erroneous concept of what we call transcoding. As we said before, quantitative sources have usually been elaborated in view of transmitting pre-digested information to third parties not directly involved in the described actions. Such documents can usually be loaded to the computer as such, without extra transcoding: they have been naturally coded by their authors in a way which makes them compatible with computer standards, even before the computing era. Users may in that way omit the most difficult and most time consuming stage of the computing process. A fact which makes quantitative data look more computer friendly than other sources. This is a mere delusion, due to a more advanced state of preparedness of the data. So much for this legend. Computer are not specially good at digits. They are equally good at every kind of adequately coded data.Computers manage alphanumeric data as well as digits once they have been transcoded, split into homogeneous blocks, marked with the necessary meta-data to be handled by the machine. Computer deal with such data without limits. Bulk, nowadays, does not practically matter at the scale we, social scientists, are working, as long as the database has been correctly structured. I personally handle, with a supermarket-class personal computer, a database of more than half a million records holding data referred to some 70.000 persons, with instant response to almost any query I can think of. "Instant" meaning something between one (the most usual case) to five seconds. Being able of processing a high volume of information is a fundamental pre-requisite for a scientific database, but it is only half the story.The most interesting part of it lies in the fact that, by recasting information into data formatted after a unique pattern or, better said, as we shall see, after a limited number of linked patterns, makes possible to access, simultaneously, with equal ease and similar lack of friction, information sets obtained from a variety of different sources. The operation of transcoding projects all the information to a same plane or, better said again, onto a limited number of planes, linked all of them by links which make possible crossing from one set to another, or if necessary viewing them all as a whole, accessing them all as a same and unique space. We'll see further, through examples drawn from Fichoz, how things work. We'll just stress here some positive consequences for research of this reduction of a composite set of information to a same plane.Prosopography, for instance, changes in nature. Traditionally, a prosopographical study selected a limited number of individuals and described them after a limited series of dimensions17. Even if the researcher grew aware during the research that the criteria of his first selection were inaccurate, he had no choice but to go on because any change would mean a fresh start, while research programs cannot be extended over time 18. Conversely, a prosopographical study based on a database of life courses obviously starts from a predefined set of individuals, but is not constrained by the same. Better still if the database is a collective venture, such as Fichoz, a feature which computers and internet make nowadays possible, and stores data contributed by various authors. The bulk of the database, the variety of contributors involved, put at the researcher's disposal a huge quantity of data on persons who do not belong to the studied group, but who may be related to it in various ways. When finding a new name, the researcher is able to locate and identify the concerned actor. He is able to follow up clues, strings of chained actors. He is able to call about each of them genealogical and family data, and a huge variety of miscellaneous information so as to obtain an adequate profile of the actor. All this could be done manually, of course; but it would mean a life work, while a collective database makes it possible in matters of hours, sometimes of minutes. A huge variety of packages for

    17 See Fayard, Janine, Los miembros del Consejo de Castilla (1621-1746), Madrid [Genève], Siglo XXI [Droz], 1982 [1979], trad. esp., 580 p., a book considered as one of the most accomplished prosopographical studies ever written.

    18 Lambert (David), Notables des colonies. Une élite de circonstance en Tunisie et au Maroc (1881-1939), Rennes, Presses Universitaires de Rennes, 2009, 367 p. is a clear example of this kind of problems.

  • 14

    analysing data are now at the researcher's disposal, from the most classical datasheets such as Excel® or OpenOffice®, up to more sophisticated tools such as Pajek® for network analysis, Orange canvas® for statistical analysis, Atlas.ti® for marking texts with beacons and for conceptual analysis, Arcgis® or Cartes&Données® for cartography19. Data can be transferred to any of them from the database in question of minutes, or seconds. The researcher is now able to test any hypothesis, in an instant way if the necessary data have already been loaded, in some days if not. This changes it all. It is not only a question of doing it faster. Speed makes it possible at all. The researcher recovers a capacity of really exploring, really investigating the word he is studying without more limits than his own creativity. Computing makes "qualitative" studies possible.Not as they used to be: reading sources, catching one idea, one possible working hypothesis, "proving" it by demonstrating that it made sense of the sources and declaring oneself happy with it without contrasting it with other possible theories for sheer lack of time. But as it ought to be, by testing conflicting hypothesis over huge quantities of data which have not been previously selected, which have not even been chosen and loaded to the system by the concerned researcher, by demonstrating that one of these hypothesis gives a better account of these data. Proceeding in such a way would be impossible without the database. A true database. Accumulating data in a variety of conflicting formats, or in mere text form is obviously better than handling them in paper sheets: clicking a trigger is faster than getting a book from a shelf, consulting indexes and finding pages. But it does not mean any qualitative change: the researcher's mind is still the place where bits of information from different sources must be brought together, compared and combined, the only place in which they hold together. A true database keeps data interconnected with one another out of the researcher's mind. The computer assumes mechanical, material and repetitive tasks. The researcher can devote his whole intellectual capacity to the most creative part of his job.2) ErgonomicsThis, of course, has a price. First al all, time. Computing helps to do things the researcher could not do before. But it does not properly save time, except under particular circumstances. Loading data is in most cases a time consuming process which cannot be left to non-expert hands. By "experts", we mean experts in the field the data belong to, not experts in computing matters. We remind that - with the possible exception of purely statistical data - sources must be understood, decoded and recoded, which demands a perfect understanding of the matter in question and a capacity to detect unadvertised changes in the sources. If sources are not highly standardised documents, previously examined by a supervisor to make sure that they do not contain unadvertised traps, only post-grade students and up are able of doing the job. Creating a database is not a cheap process. The counterpart is that once this work has been done, it has been done for good and for ever, and nobody will ever need to do it again, at least as long as the basic ideas on which our present scientific practice is based hold on. Moreover, computerized data can be placed at the community's disposal. Thus, if no direct individual gain can seriously be considered, a global benefit is obvious. But the researcher is a member of the collectivity, and as such has a share of this collective benefit, derived from the fact that his own data are ipso facto embedded into a far larger set of similar data, a setting in context which increases their value. A second set of benefits must be associated with the fact that data, once loaded to the database, can almost immediately and directly be processed. From this point of view they are quite different from full text notes, which had to be previously read (better said re-read); in which relevant parts had to be selected and copied, usually manually coded or transcribed before use. In a true database, all that has been previously done. Globally, nevertheless, little time is to be gained, and the researcher must be prepared to spend much on what many historians used in past times to consider as a subordinate task: transcribing and transcoding sources.This reassessment of source work I rather consider as a benefit in itself. I always thought, even in pre-computer times, that it was a fundamental and too-neglected part of the research process, of which everything else depended. The same positive conclusion may be drawn from a close examination the second possible drawback induced by the computer, complexity. Transcoding data to a limited set of tables makes possible to handle together huge sets, composed of data of every description. For instance, a database on painted scripta (painted inscriptions) of Roman Pompei20 displays at the same time transcriptions and translations of the scriptum, a complex set of locations indicators, a set of descriptors freely chosen by the researcher defining the characters of the scriptum, a set of records describing the actions worked out by the actors mentioned in the same and the comments provided by current historiography about the text and the mentioned actors. All that on the same screen, with a possibility of accessing data drawn from every table (Insert XXI). It does not obviously look as the simplest way of doing things, and many analysts would disagree with such a proceeding. I refer them to Appendix I for a theoretical answer. I'll just underscore here the fact that the researcher needs all this information to characterize the scriptum for research purposes. And that the demands of the researcher must be the programmer's supreme law. Complexity does not come from the computer side of the business; the computer only brings to the forefront the latent complexity of the data and makes possible to manage it.

    19 Further information can be easily retrieved from the homepage of every on of these package on Internet.

    20 Courrier (Cyrille), Dedieu (Jean Pierre), "Pompei scripta. A database for the study of Roman painted inscriptions", in: Nulla dies sine littera. Instrumenta inscripta, to be published.

  • 15

    Insert XXI. An example of complexity: the Global layout of Pompei database's Scriptum section:

    The layout displays a global view of all the information pieces related to the scriptum. It allows the user to take into account not only the proper characters of the same, but also those of its surroundings. A set of standard colours helps him to cope with the inherent complexity of the data.

    Database building must obviously take this complexity into account. First of all the database must be visually perfect: field borders must adjust exactly (one pixel more or less is enough to create a visual recess which calls the user's eye in a disorderly way), determined sets of colours must be systematically used to let the category to which each part of the screen belongs be known at first sight. Abbreviations must be banned. Capital letters must be used after the usual practice of the language of the database, etc. Such visual elements are fundamental because the researcher's eye is part of the chain along which information runs. All that slows down viewing and understanding makes the whole information system slower.A complex database makes possible a great variety of queries. We already insisted, from this point of view, on the importance of a user-friendly package. Some of these queries are more usual than others. Those ones must be pre-programmed, and a set of triggers placed at the user's disposal on the screen. To make easier the management of the multifarious combinations of data which arises from the complexity of the database, a variety of pre-programmed screen layouts must be provided, and a margin must be left to create more on demand.

  • 16

    Insert XXII. An example of complex layout: a fragment of the Main layout of the Actions table of the Actoz database

    Each field has been given a different colour, depending on its content. This colour is repeated through the whole database for fields with a same content. A set of triggers makes possible to launch frequently used routines. The colour of the triggers denotes its function: all blue triggers, in the whole database, for instance, retrieve and display all the records which contain in the neighbouring field the same content as the current record. A trigger is a small area programmed to launch an automatic routine (a "script") and carry on a task without the user's intervention.

    Easiness and speed at creating and changing screen layouts must thus be considered as a basic factor in the selection of a database package.c) PackageFichoz presently runs on FileMaker®21. We chose this package for the following reasons:a) Availability. A database is a long-term endeavour; ideally a perennial one. In themselves, packages are not perennial.

    Insert XXIII. Every time betterWe personally used three generations of databases run on personal computers as applications (apps), since 1981: 1) the non-relational generation, with dear old DBase®, and some others, like Texto®, the one we begun Fichoz with, which was absolutely brilliant but not exactly ergonomical; 2) the first relational generation, with Access®, still in use but fading away by now, a compromise between various strategies which never really worked, or 4D®, another brilliant program but too complex and too strictly limited to MacIntosh® surroundings to become a standard; 3) and FileMaker®, which is at the same time relational, totally indexed, and wholly based on WIMP technology (see above).

    Consequently, and apart from any other consideration, we discard home-made packages, which strictly depend on the person who built them. We also discard commercial or open source packages which are not circulated enough to warrant good chances of long term upkeep and the probable existence of compatible successors to recover stored data in case of disappearance.b) Flexibility. Flexibility is a most essential characteristic when the database package is used for research. Flexibility at formatting layouts, as we saw above; but also flexibility in allowing easy changes to the database structure: changing field names, adding new fields, suppressing unused ones, inserting new tables without breaking down the database, copying programs from one file to another without having to write them anew, all those are basic requirements. The package must be flexible enough to allow handling the data in the most natural possible way.c) Power. A very simple requirement: fully indexed fields. They alone make compatible speed and flexibility. They alone allow to write data using the same words as the source, in the order of the source, without slowing down queries. They alone make possible storing long character strings to a same field without loosing capacity to retrieve them. A powerful and simple programming language, which allows an easy construction of automatic routines. And, of course, an unlimited capacity as for the number of records (let us remind that a scientific database must be planned as a never ending object), and as to the length of the same.d) Simplicity. Queries in natural language, without, in most cases, syntactic signs to be appended; a clear-cut distinction between the user's and the administrator's areas; full WIMP technology. An undergraduate who knows no more than pushing keyboard keys must be made able to load data after a couple of hours. Making it possible in Fichoz's case, in spite of the intrinsic complexity of the database, was a weighty point in favour of FileMaker. I insist on the fact that these features are essential to the management of the database. Apart from FileMaker®, I simply do not know of any other application of identical characteristics. The main weakness of FileMaker®, from my point of view, is that it is does not currently manage SQL queries, which make difficult network access to the database. Better said, a local network up to six clients can be very simply implemented with FileMaker®.

    21 Complete information, blog, sale conditions, examples, trial version and so on to be found at FileMaker portal, http://www.filemaker.fr (French version), or http://www.filemaker.com (English version).

  • 17

    An external network can be easily implemented up to some 280 clients with FileMaker Server®, but a FileMaker package must be installed on every client's computer. Such limitations make possible for a scientific team to work together, but prevent any real open web publication22.The choice of an application is a most important point, but not the main one. More important still is the fact that the database must be structured in such a way as to be easily transferable to any other package of the same characteristics. Fichoz makes an intensive use of the facilities provided by the unlimited length of fields and most of all by the speed of the queries. Its structure can nevertheless be easily copied to any other relational database package. A lot of programming has been made. It cannot be transferred. But the database can very well work without it: it makes things easier and faster, it does not do anything essential, nor is it in any way necessary to the essential working of the database. I should make of this observation a general point. Passing programs from one application to another one is all but impossible; so that to be made transferable from one package to another, a database must work, in its essential parts and functions, without specific programming. This point is no trivial matter. If a database is virtually eternal, it will necessarily have to be transferred in such a way some day.We expounded in this chapter our views on what a database must be. We drew them fundamentally from our experience with Fichoz. Fichoz has been planned for social history. The problems we formulate and the conclusions we draw must be understood as bearing first and foremost on social history research and sources. Nevertheless, they can be extended, with minor changes, to other fields of knowledge. Principles like atomization of the data, a correct adequacy between database structure, data and purpose pursued by the database, or the necessity of good ergonomics, look fundamental enough to be considered as rules of general application in this field.So far for generic considerations. We'll now have a look at Fichoz to see how such principles have been implemented.

    22 We are presently working with the engineer Gérald Foliot, of the CNRS, on routines which would allow automatically to transcribe FileMaker files to SQL tables.

  • 18

    II. Tables structurea) A list of tablesFichoz consists of a set of related tables. Each one of these tables has been planned to process a class of data. A Fichoz database is not necessarily composed of all them. It is a combination of selected tables which, put together, make possible to process a given set of data. With few exceptions, each one of these tables is stored into a different file. We found storing various tables into a same file quite efficient for short range programming, but also highly cumbersome when composing special sets from which we planned to leave some tables out.1) Primary data tablesThey hold primary data. Strictly speaking, Fichoz could be reduced to them. They are linked to one another by a variety of links. Each one has been structured and optimized to store one class of data.

    Insert XXIV. A list of Fichoz primary data tables (situation on the 1st of november of 2011) 23 :Name Content Class UseActions Actors' actions Primary data, a list of actions and binomial relations between actors GeneralArray_D2 Statistical array Primary data; data to be presented in to dimensional array form SpecialCargo Cargo carried by ships Primary data; a list of cargo items NavigocorpusCharacters Characters of the object Primary data; an open list of characters describing the object PompeiComments Historiographical comments Primary data; historiographical comments extant on the described object PompeiDocuments Documents Primary data, content of legal and private documents used in the DB GeneralGenealogy Genealogical data Primary data; family relationships GeneralGeography Description of places Primary data; a list of actions related to geographical places24 SpecialLocations Location of objects Primary data; location parameters of archeological items PompeiPictures Graphic documents Primary data; all kinds of graphic documents PompeiPoints Ships' routes Primary data; a description of the journeys made by ships NavigocorpusScripta Text of inscription Primary data; the text of the inscriptions stored into the database PompeiSources Description of sources Primary data; a description of sources; a description of cultural objects GeneralTaxes Taxes paid Primary data; a list of all tax items paid by the ship Navigocorpus

    2) Mirror tablesWe explained in above the paper of the "Dictionary" files. They are fundamentally a help to handle primary data files and manage on-the-way coding strings. A mirror file matches every primary data file.

    Insert XXV. A list of Fichoz mirror tables (situation on the 1st of november of 2011):Name Content Class UseDictionary_actions Dictionary of actions A mirror of Actions; a list of actions related by Actions GeneralDictionary_actors Dictionary of actors A mirror of Actions; a list of actors mentioned in Actions GeneralDictionary_characters Dictionary of characters A mirror of Characters; a list of characters mentioned in characters SpecialDictionary_comments Dictionnary of comments A mirror of Comments; a list of comments mentioned in the database SpecialDictionary_documents Dictionary of documents A mirror of Documents GeneralDictionary_encounters Ship encounters Primary data; a list of encounters mentioned by ships studied in the DB NavigocorpusDictionary_genealogy Dictionary of Genealogy A mirror of Genealogy; a list of actors mentioned in Genealogy SpecialDictionary_geography Dictionary of places A mirror of Genealogy; a list of places mentioned in Geography SpecialDictionary_items_cargo Dictionary of cargo A mirror of Cargo; a list of cargo items NavigocorpusDictionary_locations Dictionary of locations A mirror of Locations; a list of locations mentions in Locations PompeiDictionary_points Dictionary of points A mirror of Points; a list of all point composing journeys NavigocorpusDictionary_routes Dictionary of journeys A mirror of Points; a list of points grouped into routes sequences NavigocorpusDictionary_scripta Dictionary of inscriptions A mirror of Scripta; a list of inscriptions stored into the database Pompei

    23 Meaning of the column "Use": "General" means that the table must be part of all possible Fichoz set; "Special" means that the table is not necessary in every Fichoz set; "Pompei" and "Navigocorpus" are the name of specific Fichoz sets; the mentioned table is part of those and of none other.

    24 Planned to be the basis of a dynamic historical atlas. Geography processes the relationship between an administrative district and a place which belongs to the same in the same way as a relationship between two actors.

  • 19

    3) Context tablesThey provide extra data which help setting primary data in their proper historical context.

    Insert XXVI. A list of Fichoz context tables (situation on the 1st of november of 2011):Name Content Class UseDictionary_commodities Dictionary of cargo. A mirror of Cargo; a list of commodities mentioned in Cargo NavigocorpusDictionary_measures Dictionary of measures A mirror of Cargo; a list of measures mentioned in Cargo NavigocorpusDiem Context data A dictionary of historical concepts used in the current database GeneralDiemchro Context data A chronology of events related to the content of the current database GeneralGeo_general Gazetteer A list of geolocated points General

    4) Technical filesThey provide tools for maintenance and servicing of the system.

    Insert XXVII. A list of Fichoz technical tables (situation on the 1st of november of 2011):Name Content Class UseDuplicate Duplication tool Program. Creates an empty copy of the current database GeneralExport_points Cartographic export tool An empty file prepared to receive exports from Geo_general GeneralExport_to_Pajek Pajek export tool An empty file prepared to make data directly readable by Pajek GeneralHelp Help entries Metadata GeneralImport Empty structure Container of any other Fichoz database to be merged to the current one GeneralPajek_relationships Pajek export tool An empty file prepared to reformat data after Pajek's requirements GeneralPajek_vertices_dictionary Pajek export tool An empty file prepared to reformat data after Pajek's requirements General

    5) Composing specific setsFiles/tables selected for use in a special set generally preserve their standard names, adding a prefix indicative of the current set.

    XXVIII. Pompei, a database on electoral propaganda inscriptions found in Pompei:Primary data files:Pompei_actions, Pompei_array_D2, Pompei_characters, Pompei_comments, Pompei_documents, Pompei_locations, Pompei_pictures, Pompei_scripta, Pompei_sources

    Dictionary files:Pompei_dictionary_actors, Pompei_dictionary_characters, Pompei_dictionary_comments, Pompei_dictionary_documents, Pompei_dictionaryi_locations, Pompei_dictionary_scipta

    Context files:Pompei_dima (the same as Diem), Pompei_dimachro (the same as Diemchro)Technical files:Pompei_duplicate, Help, Geo_general, Export_points, Pajek_relationships, Pajek_vertices_dictionary, Export_to_Pajek

    All files belonging to a same database must be preserved within a same folder. The name of the folder may be freely altered by the user. Not so the name of the files, which must be strictly preserved. Any change of the same would break existing links.We shall presently describe the various sets of Fichoz files (Primary data, mirror, context and technical) one by one. This sections is still incomplete. We almost fully leave aside, for the time being, all specific Navigocorpus tables. We also leave aside, for the description of the Mirror and Technical tables. The few points we touch will nevertheless be enough to give the reader a clearer idea of the way Fichoz does things.

  • 20

    b) Classes: primary data tables1) The "Actions" groupThe "Actions" main file contains chronological data on actors, corporations and cultural objects. Every action worked or suffered by each actor forms a record in the file. Each record is fundamentally composed of five fields, as explained before, the 5W pattern: Who? What? Where? When? With whom?

    . Relationships between actors are stored as pairs: each binomial relationship forms an independent record, which contains the name (and identifier) of both actors, one of them in the "Who?" field, the other one in the "With whom?" field.. The "Actions" file is the backbone of the system. It exists in every special sets of the same.

    The "Actions" files may be complemented, in some versions of the system, by auxiliary files. Some classes of actions, in fact, although they match the general definition of the actions class (moves made by actors) cannot be practically cut to a simple 5W pattern. Better said, the division is theoretically possible, but the results would be difficult to handle practically. In such cases, we chose to create specific auxiliary actions files, the core of which is made of a 5W pattern, but in which data are displayed in a way fully adapted to their nature and to the user's needs. These tables are linked to the main "Actions" table, and the results taken into account in that way. Insert XXIV mentions some auxiliary tables of this kind, such as Pompei_scripta or Pompei_comments. Inserts XXIX and XXX explain the need and structure of other two such tables in different contexts.

    Insert XXIX. Auxiliary actions file. Navigocorpus_pointsNavigocorpus is a database of ship travels. It is based on port registers. Each entry of the same describes a travel as a set of chronologically ordered points which the ship sails through. The number of points mentioned by a documentary entry varies from one (eg. such a ship has been sailing around such a point for fishing purposes) up to an unlimited number (the maximum found till now being 13). A same travel of a same ship is usually mentioned by various registers, one in every port she touched, with slight variations due to the relative position of the port in which the register has been written within the global route of the ship. Some of those mentions refer to past actions (the ship has already crossed the point), others to future intended actions (the ship is expected to cross the point in a near future). So that if the subject referred to the question Who? is unique (the ship), the answer to Where? is multifarious. The constitutive rules of the 5W patterns make necessary to create as many records as points mentioned by everyone of the sources, to mark them with a marker indicative of their future or past character, and with a second one indicative of what record should be selected as the main one among the various records referred to the same point. A set of specific layouts has moreover be written to process a complex set of extra information, the details of which we silence here. In lieu of imbedding such features into the Actions layout, we preferred to create a specific table which. Apart from processing ship itineraries, this new structure is also able to process any kind of moves in space and to link the neat results (a list of unique mentions of all the points describing a same travel) to the Actions table to process them along with other actions.

    Insert XXX. Auxiliary actions files. Genealogical tableA genealogical table describes a set of relationships between actors. These relationships can be reduced to two classes: descent and marriage. Marriage is easy to insert into the Actions table. It is formally a typical "With whom?" question. The problem lies in the intricacy created by the set of interwoven relationships as soon as the number of concerned actors grows above half a dozen. Usual interpersonal relationships rarely stretch beyond the third degree: a friend of my friend is the practical limit I'm able to access. Kinship is far more complex. The son of the brother of the husband of my wife's sister is a relative to me. If descent and marriage alone are enough to give an account of first level relationships, they must be combined into long strings for second, third, fourth and so on relationships. In the example given above, the combination would be (read from right to left):

    chCHFin which: F stands for son, H for brother, C for husband, h for sister and c for wife.We decided that, in that case too, it was better: 1) to create a special table for genealogical relationships; 2) to link the actors who featured in the same to the mentions of the same actor in the Actors' table: 3) to process kinship within the genealogical table and to displays results, when needed, onto the actors' table main display layout by means of this link.

    2) The "Documents" tableFichoz is a historical database. Historical sources use to mention actions within documents. The historian, in fact, works with documents. The fact that

  • 21

    two different actions are mentioned within a same document is, in most cases, in itself an important historical information, that must be preserved when storing the data to the database; specially when the document not only creates a topological relationship between actions by mentioning them together, but also when it tells a story in which various actions are embedded and linked together in a special way by this same story.

    Insert XXXI. A legal document. A will.Summary:John Mainard's, medical doctor, will, drawn at Dover (Kent), on May the 1st 1873.Document identifier: Document class:01 WillContent:"I, John Mainard, aged 67, general practitioner, living at Mainard's house, Mayflower Street, Dover (Kent), sane of body and mind, wrote these my last wills and dispositions. First, I commend my soul to my Saviour, in whose faith I always lived and protest to die. My body will be buried in the graveyard of Saint Peter's Church in Petersborough, my birthplace, along with the body of my beloved father. I leave a life rent of 100 (one hundred) pounds a year to Mary Osborne, my lifelong servant, as well as the right to live in the keeper's house at the western gate of Mainard's house. I leave the remainder of my properties to my nephew William Stopwith, medical practitioner at Saint George's hospital, in London. This will and last dispositions I sign with my name and subscribe, at Dover, May the first 1873. John Mainard

    Actions:With whom Who What When Where Doc. id.

    Mainard, John Medical doctor 1873

  • 22

    Insert XXXII. Ceasar's deathDocument: [Date] March 15th 44 a.c. [Content] Cesar's murder by conjured members of the Roman nobility [Record id.] 00000034Actions: [Date] March 15th 44 a.c. [P2] Cesar [Text] Death [Document id.] 00000034

    [Date] March 15th 44 a.c. [P1] Cesar [Relationship] Murderer [P2] Brutus [Document id.] 00000034[Date] March 15th 44 a.c. [P1] Cesar [Relationship] Murderer [P2] Cassius [Document id.] 00000034...

    Cesar's murder is an important coordination point of Roman political history. It is important to be able to follow all implied actors before and after. Fichoz users may attain this result by creating in the Document table an event "Cesar's death" to which all records of the Actions table referred to Cesar's murder will be linked. A simple query on the "Document id." field makes possible to locate on demand all actors involved in the same and, the case being, to mark them with a special on-the-way code.

    A third function of the Document table consists in creating sets of interlinked documents to tell complex stories. A same story is usually told by various documents: either each one tells a part of the story, or each one tells a different version of the same story, from another point of view. It is possible, with Fichoz, to link various