comparing database man- agement systems with sqlalchemy1298578/fulltext01.pdftor is that designers...

36
Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Linköping University | Department of Computer and Information Science Bachelor thesis, 16 ECTS | Datateknik 2019 | LIU-IDA/LITH-EX-G--19/074--SE Comparing database man- agement systems with SQLAlchemy A quantitative study on database management systems Jämförelse av databashanterare med hjälp av SQLAlchemy Marcus Fredstam Gabriel Johansson Examiner : Anders Fröberg

Upload: others

Post on 10-Feb-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Linköpings universitetSE–581 83 Linköping

    +46 13 28 10 00 , www.liu.se

    Linköping University | Department of Computer and Information ScienceBachelor thesis, 16 ECTS | Datateknik

    2019 | LIU-IDA/LITH-EX-G--19/074--SE

    Comparing database man-agement systems withSQLAlchemy– A quantitative study on database management systems

    Jämförelse av databashanterare med hjälp av SQLAlchemy

    Marcus FredstamGabriel Johansson

    Examiner : Anders Fröberg

    http://www.liu.se

  • Upphovsrätt

    Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 årfrån publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstakakopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och förundervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva dettatillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. Föratt garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sättsamt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende elleregenart. För ytterligare information om Linköping University Electronic Press se förlagetshemsida http://www.ep.liu.se/.

    Copyright

    The publishers will keep this document online on the Internet – or its possible replacement– for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone toread, to download, or to print out single copies for his/hers own use and to use it unchangedfor non-commercial research and educational purpose. Subsequent transfers of copyrightcannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measuresto assure authenticity, security and accessibility. According to intellectual property law theauthor has the right to be mentioned when his/her work is accessed as described above andto be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of documentintegrity, please refer to its www home page: http://www.ep.liu.se/.

    c© Marcus FredstamGabriel Johansson

    http://www.ep.liu.se/http://www.ep.liu.se/

  • Abstract

    Knowing which database management system to use for a project is difficult to knowin advance. Luckily, there are tools that can help the developer apply the same databasedesign on multiple different database management systems without having to change thecode. In this thesis, we investigate the strengths of SQLAlchemy, which is an SQL toolkitfor Python.

    We compared SQLite, PostgreSQL and MySQL using SQLAlchemy as well as com-pared a pure MySQL implementation against the results from SQLAlchemy.

    We conclude that, for our database design, PostgreSQL was the best database manage-ment system and that for the average SQL-user, SQLAlchemy is an excellent substitutionto writing regular SQL.

  • Acknowledgments

    We want to thank Jon Dybeck at the Department of Computer and Information Science atLinköping University for giving us the opportunity to work on a project that this thesis isbased on.

    We would also like to thank Anders Fröberg at the Department of Computer and Infor-mation Science at Linköping University for helping us throughout this thesis by giving usideas and inspiration when we got stuck.

    iv

  • Contents

    Abstract iii

    Acknowledgments iv

    Contents v

    List of Figures vi

    List of Tables vii

    1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Theory 32.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Database Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4 Database design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Object-Relational Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 SQLAlchemy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3 Method 123.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4 Results 164.1 Time results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Code comparison results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    5 Discussion 235.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    6 Conclusion 266.1 Time test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2 Code comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    Bibliography 28

    v

  • List of Figures

    2.1 Example of 1NF of two database schemas. . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Database schema in 2NF using ID as primary key. . . . . . . . . . . . . . . . . . . . 72.3 Example of 3NF of two database schemas. . . . . . . . . . . . . . . . . . . . . . . . . 82.4 ER-model of the SQLAlchemy example. . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.1 EER-model of the database design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4.1 Average insert time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Average delete time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Average search time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.4 Average update time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 Inserts per second. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.6 Deletes per second. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.7 Searches per second. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.8 Updates per second. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    vi

  • List of Tables

    3.1 Hardware and OS used for testing the DBMS. . . . . . . . . . . . . . . . . . . . . . . 14

    4.1 Result data from the insert time tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Result data from the delete time tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Result data from the search time tests . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Result data from the update time tests . . . . . . . . . . . . . . . . . . . . . . . . . . 174.5 Result data from the code comparison test. . . . . . . . . . . . . . . . . . . . . . . . 22

    vii

  • 1 Introduction

    There are many database management systems (DBMS) on the market. Each DBMS has aunique set of commands available to the user to be able to interact with the database. Thismeans that once a database has been implemented with one DBMS, it will require a lot ofwork to implement the same database for a different DBMS. We believe that it is difficult toknow which DBMS to use in advance, making it a risk when developing applications thatuse databases. In other words, since it takes a lot of effort to switch DBMS, it is important tocarefully consider which DBMS to use.

    Luckily, there exist tools that developers can use to help create databases, some of whichcan be applied to multiple different DBMS. This is a huge advantage for developers who havenot decided or are unsure of which DBMS they want to use. However, the DBMS may havedifferent performances using the same tool. I.e., DBMS A may perform better than DBMS B(and vice versa) using tool X.

    In this thesis, we use the database tool SQLAlchemy for Python and compare the perfor-mance of different DBMS.

    1.1 Background

    Due to administration difficulties at the Department of Computer and Information Science atLinköping University, the system administrators wanted a tool that would help them keeptrack of what software was needed by each course on their computer environment, and storethat information in a database. Therefore, the system administrators thought of the idea ofrunning a website that the course staff could use to request software that they need for theircourse during the semester. While working on this project, we were considering which DBMSto use for the website. We eventually found a database tool called SQLAlchemy and used itto perform a study to figure out which DBMS fit our design best.

    1.2 Aim

    The purpose of this project is to evaluate different database management systems withSQLAlchemy.

    1

  • 1.3. Research questions

    1.3 Research questions

    These are the questions that our project will be looking into.

    1. How well do different database management systems perform compared to each otherwith SQLAlchemy?

    2. How does SQLAlchemy, using MySQL as database management system, compare to apure MySQL implementation?

    1.4 Delimitations

    There were a couple of self-set delimitations during this project such as only using Python,only using relational databases, and having the tool only be compatible with Linux. Theseare delimitations we chose to work with ourselves, either because of ease in developing orbecause of requirement specification. We also delimited ourself to use SQLAlchemy exclu-sively.

    Another factor that impacted the project was time. Since we had limited time with thisproject, we could not run some of the tests at larger volumes, which we otherwise wouldhave wanted.

    2

  • 2 Theory

    This chapter discusses the theory of different topics that are necessary for understanding thisthesis. The chapter is divided into the following sections. Section 2.1 discusses the verybasics of what data is. Section 2.2 explains what a database is at high level. Section 2.3 bringsup what a database management system is and why they are necessary for managing data.Section 2.4 discusses design steps for making sure the database schema is in a good state.Section 2.5 explains what object-relational mapping is. Section 2.6 talks about the Pythonpackage SQLAlchemy for creating databases at a higher level.

    2.1 Data

    Data can be described as information [1] if it has been given meaning. It should have somevalue that makes the owner of this information more knowledgeable. Data can either bequalitative or quantitative.

    Qualitative data

    Qualitative data is data that cannot be represented by numbers [2]. Instead, the qualitativedata is observed and defined in some other way. E.g., the colour of a shirt, the mood of yourfriend, and the name of software are data that cannot be described with numbers.

    Quantitative data

    Quantitative data can, unlike qualitative data, be represented by numbers [2]. Numberscan have different form depending on what kind of data it is, e.g., percentages, averages,and time. E.g., how fast a car is travelling, the number of employees at a company, and thetemperature in a room are data that can be represented in numbers.

    Data is most often collected when the person who collects the data want to become moreknowledgeable [1]. When collecting data, the amount can become overwhelming in somecases. Thus, having the data organised some way is a good idea.

    3

  • 2.2. Database

    2.2 Database

    A database stores and organizes data electronically [3]. Databases are often used by compa-nies and organizations to manage their vast amounts of data. An example is an airport whichkeeps track of flights, travellers, tickets, and planes. The airport can use that information tofigure out how many seats of a flight are not booked, how many tickets are sold, and whichtraveller used which ticket.

    There are different kinds of databases, whereas the most common ones are relationaldatabases, object databases and object-relational databases [3]. Another database type thathas become popular in recent years are NoSQL databases [4, 5].

    Relational data model

    Based on the relational model proposed by Codd [6], relational databases use one or moretables. Each table has attributes (columns) that are of some data type, and rows, that hold thedata. Tables can be related to each other, i.e., it is possible to use one table to get data fromanother table. This makes it easier to find and manage data.

    Object data model

    Another database model is the object data model, where the idea is to use the notion of object-orientation to group data. Instead of storing the data in tables, the data is stored in objects.The object data model defines objects, their attributes and functionality, whereas the key fac-tor is that designers can define data types, which is not possible in the relational data model[3].

    Object-Relational data model

    Over the years, the relational data model has been expanded with concepts and functionalityfrom the object data model, making up what is called the object-relational data model [3].In the object-relational data model, just like the object data model, designers can define datatypes, but instead of storing the data in the objects, the data is stored in tables, just like in therelational data model.

    NoSQL model

    NoSQL stands for "Not only SQL" and does not use the traditional tables that relationaldatabases use. They are designed for millions of users interacting with the database all at thesame time [4]. NoSQL databases scale horizontally, i.e., you can add more servers insteadof having to upgrade on a central server, as well as being able to distribute the load anddata on all of the servers. Three popular types of NoSQL databases are key-value stores,column-oriented databases, and document-based stores [5] (not covered here).

    To be able to use these models and manage data, one typically would use a databasemanagement system.

    2.3 Database Management Systems

    A database management system (DBMS) is software that uses database models and thusallows for creating and managing databases [3]. Users interact with the DBMS by either agraphical user interface (GUI) or by running structured query language (SQL) commands.SQL is used to tell the DBMS what to do, such as insert new data, update or delete alreadyexisting data, and select data for viewing.

    4

  • 2.4. Database design

    A few advantages in using a DBMS are redundancy control, restricting unauthorized ac-cess and communicating from a distance [3].

    Redundancy control makes sure that each logical item is only stored once, and sometimes,even using denormalization to reduce the number of redundant queries on important data.

    Restricting unauthorized access uses the notion of user permission to make sure that eachuser only can perform what their access-right allows them to do.

    By using a DBMS, it is also possible for multiple users from multiple locations to receivedata from the system without knowing the physical location of the database. This means thatit is possible to create applications with databases, in which users do not have to query aspecific position for data since the application handles requests to the database for them.

    A few well-known DBMS of this kind are MySQL, PostgreSQL, Oracle, and SQLite.

    2.4 Database design

    Teorey et al. [7] describes a method of designing a good relational database using three steps.The first step is to create an enhanced entity-relationship model from the requirements ofthe system. The second step is to transform the model into relations in a preferred databaselanguage. The third and final step is to normalize the relations. In the upcoming parts, thesesteps are discussed in more depth.

    Entity-Relationship and Enhanced Entity-Relationship modeling

    The upcoming two sections explain entity-relationship and enhanced entity-relationshipmodelling, and how they differ from each other.

    ER-modeling

    Entity-relationship (ER) modelling is used to represent information about entities, entities’attributes, and their relationships [7, 3]. Entities are an independent existence in the realworld. While it can be a physical object like a car, phone, or a computer, it can also be a typeof conceptual existence such as a school, a job, or a country.

    The attributes of an entity are used to describe it. For example, an entity "Car" may useattributes like colour, registration number, and weight to describe itself, while an entity "dog"may use attributes like race, colour, height, and age.

    Relationships describe the connectivities between entity occurrences such as one to one,one to many and many to many. Some connections between the entities are mandatory whileothers are not. E.g., the entity "person" may or may not own an animal, but the "person"certainly has a name and a birth date.

    The ER model is useful when communicating fundamental data and relationship defini-tions with the end user. However, using the ER model as a conceptual schema representationis difficult because of the flaws of the original modeling constructs [7]. One such example isintegration, which requires the use of abstract concepts such as generalization.

    Another example is data integrity which involves null attribute values. This usually in-volves deciding whether a relationship allows a null attribute. I.e., whether it is mandatorythat a person owns an animal or if it is optional.

    The Enhanced ER (EER) model solves this by providing such generalisation and integrityof entities [3, 7]. Moreover, it is compatible with the ER-model.

    EER-modeling

    The EER-model expands upon the ER-model, inheriting the concepts of and ER-model whileadding concepts like subclass and superclass, specialization and generalization, and categoryor union type [3]. Therefore, earning the name "enhanced" ER-model.

    5

  • 2.4. Database design

    These concepts are fundamental when enhancing the ER-model, of which superclass (orsupertype) and subclass (or subtype), is a concept concerning relations. For example, for anentity type Employee, there may also be different types of employees such as receptionists,managers, cleaners and so on. These are all a subset to the entity Employee, meaning thatevery member of these entities is also a member of the Employee entity. Therefore, theseentities are called a subclass or subset of the Employee entity type, which in turn makes theEmployee entity a superclass or a supertype.

    The process of defining a set of subclasses of a superclass is called specialization. Spe-cialization also helps with establishing specific attributes of each subclass and establishingrelations among other entity types or subclasses.

    Generalization is the inverse of specialization. Namely, it defines a superclass from a setof entity types. For example, a dog, a house, and a car all have an owner. Therefore, it is pos-sible to create the entity type Owner from this common attribute. Since generalization is theinverse of specialization, it is also possible to say that dog, house, and car is a specializationof Owner.

    Lastly, the concept of union or category types deals representing more than one super-class. For example, if the superclasses School, Household, and Company exist. Then for adatabase containing computer id:s, an owner could be any one of the School, Household, andCompany. Therefore, it is possible to create a category named Owner, which is a subclassof the union between the superclasses Furniture, Household electronics, and Estates for thispurpose.

    Keys and Functional Dependencies

    One part of designing a database is determining which attributes can uniquely identifyrecords in the database. Attributes that can identify records are called keys. There are manytypes of keys, whereas some of them are candidate key, primary key, superkey and foreignkey [3].

    A superkey is a key that can uniquely identify a record and contains one or more attributesin the table to do so. A candidate key is a key that has the minimum amount of attributesof all superkeys. There can exist multiple candidate keys. Attributes that are part of somecandidate key is called prime attributes and attributes that are not in any candidate key iscalled non-prime attributes. A primary key is a chosen candidate key that is used to identifyunique records in the database.

    Sometimes you want to reference another table. Referencing another table is done by us-ing a foreign key. A foreign key is a key that connects one table to another table, establishinga relationship between the tables. The foreign key must reference the other tables primarykey.

    These keys are not only used to identify records but also in functional dependencies.A functional dependency (FD) is a constraint between attributes in database tables [3].

    I.e., they define relationships in a database. FDs are used to determine the normal form of thedatabase schema. There are six inference rules, which among other things are used to makethe normal form of the database better.

    1. Reflexivity: If Y is a subset of X, then X -> Y.

    2. Augmentation: If X -> Y, then XZ -> YZ.

    3. Transitivity: If X -> Y and Y -> Z, then X -> Z.

    4. Union: If X -> Y and X -> Z, then X -> YZ.

    5. Decomposition: If X -> YZ, then X -> Y and X -> Z.

    6. Pseudotransitivity: If X -> Y and WY -> Z, then XW -> Z.

    6

  • 2.4. Database design

    Normal Forms and Normalisation

    To make sure the database schema is of high quality, the schema has to be in at least thirdnormal form, which it gets to by going through normalisation. The four most critical normalforms for a high-quality schema are the first normal form, second normal form, third normalform, and Boyce-Codd normal form.

    Normal Forms

    The first normal form (1NF) states that each attribute in a schema has to be atomic [3, 8].I.e., the attribute cannot be divided into multiple values. E.g., a schema containing a singleattribute "Person" (see figure 2.1a) which contains data about a person; name, height, weightand hair colour. This schema is not in 1NF, but can effortlessly become that by distributingthe attribute into multiple attributes, as seen in figure 2.1b.

    (a) Database schema that is not in 1NF. (b) Database schema that is in 1NF.

    Figure 2.1: Example of 1NF of two database schemas.

    The second normal form (2NF) states that the schema must be in 1NF and also that everynon-prime attribute is fully functionally dependent on the entire primary key [3, 8]. I.e., theprimary key must be able to determine the values of every non-prime attributes. E.g., infigure 2.1b (which is in 1NF), the database schema would not be in 2NF if the name was aprimary key since there could be multiple people who have the same name. The same goesfor all other attributes. By adding a unique identifier to the schema and using that attributeas the primary key, the schema would be in 2NF (as seen in figure 2.2).

    Figure 2.2: Database schema in 2NF using ID as primary key.

    The third normal form (3NF) states that the schema must be in 2NF and also that thereare no non-prime attributes that are transitive dependent on the primary key [3, 8]. Thismeans that no non-prime attributes should be able to determine another non-prime attribute.E.g., figure 2.3a shown a schema of people’s graduation, and where ID is the primary key.The table is in 2NF and not in 3NF because the non-prime attribute School can determine inwhich city the school is. To make the schema in 3NF, the tables have to be split into two tablesas seen in figure 2.3b.

    7

  • 2.5. Object-Relational Mapping

    (a) Database schema that is not in 3NF. (b) Database schema that is in 3NF.

    Figure 2.3: Example of 3NF of two database schemas.

    Boyce-Codd normal form (BCNF) is a strong version of 3NF. It states that the schema mustbe in 3NF and also that, for every functional dependency X -> Y, X is a superkey.

    Normalization

    Normalization is about reducing data redundancy and inconsistency [8, 9]. Data redundancyis having the same data in multiple places in the database. It should be reduced because itmakes insertions, deletions and updates difficult. Insertions fill the database with more andmore of the same data, which consumes disk space. When deleting records, the problem ofaccidentally removing information about other things arises. When we delete the last recordin the table that holds redundant data, the redundant data is also deleted. I.e., the last recordthat we delete also deletes the only information we have about the redundant data. Whenupdating, all records that have redundant data have to be updated. If the tables hold millionsof records, then this update take longer to complete.

    Normalization solves this problem by splitting the table that has redundant data into twotables. The redundant data is then saved in one place, meaning that it is not repeated wheninserting new records, it is not accidentally deleted when removing the last record in theother table, and only one record needs to be updated.

    2.5 Object-Relational Mapping

    Object-relational mapping (ORM) is a way to convert between rows in a relational databaseto an object in a programming language [10]. The ORM sits between the programmer’s appli-cation and the database and helps the programmer by abstracting the SQL queries to access,remove and modify the data in the database.

    2.6 SQLAlchemy

    SQLAlchemy is a Python package that helps the programmer by abstracting the tediousparts of creating databases. Instead of writing SQL queries, the programmer writes Pythonclasses for the design of the tables and functions to access/modify the tables, and thenlet SQLAlchemy take over. SQLAlchemy uses ORM to translate from Python objects toSQL (and vice versa). The programmer can then query the database by issuing queries viaSQLAlchemy’s built-in functions. An example of usage of SQLAlchemy can be seen in listing2.1 with its ER-diagram in figure 2.4.

    SQLAlchemy can be used with a wide variety of DBMS. To be able to connect and querythe database, the programmer has to install a database driver for the DBMS that is used. Adatabase driver is a piece of software that acts as a middleman between the programmer’sapplication and the database. Its job is to convert the SQL queries sent by the applicationto the underlying format the database is using. These drivers can, just like SQLAlchemy, beinstalled with pip. SQLAlchemy supports the following DBMS out-of-the-box.

    8

  • 2.6. SQLAlchemy

    • Firebird1 - open-source relational DBMS developed by Firebird Project in 2000.

    • Microsoft SQL Server2 - relational DBMS developed by Microsoft in 1989.

    • MySQL3 - relational DBMS developed by Oracle Co. in 1995.

    • Oracle4 - multi-model DBMS developed by Oracle Co. in 1979.

    • PostgreSQL5 - open-source object-relational DBMS developed by PostgreSQL GlobalDevelopment Group in 1996.

    • SQLite6 - public relational DBMS developed by Dwayne Richard Hipp in 2000.

    • Sybase7 database - relational DBMS developed by Sybase in 1980’s.

    Figure 2.4: ER-model of the SQLAlchemy example.

    Listing 2.1: Example of SQLAlchemy usage.

    # i mp or t s q l a l c h e m yfrom sqlalchemy import Column , Integer , S tr ing , Date , ForeignKeyfrom sqlalchemy import crea te_enginefrom sqlalchemy . ext . d e c l a r a t i v e import d e c l a r a t i v e _ b a s efrom sqlalchemy . orm import sessionmaker

    engine = crea te_engine ( ’ s q l i t e :///:memory : ’ )Base = d e c l a r a t i v e _ b a s e ( )Sess ion = sessionmaker ( bind=engine )

    # d e s i g n a t a b l ec l a s s Room( Base ) :

    __tablename__ = ’room ’id = Column ( Integer , primary_key=True )room_name = Column ( S t r i n g ( 2 0 ) , unique=True , n u l l a b l e =Fa l se )

    def _ _ i n i t _ _ (name ) :room_name = name

    c l a s s Course ( Base ) :

    1URL: https://www.firebirdsql.org/, Accessed: 2018-10-182URL: https://www.microsoft.com/en-us/sql-server, Accessed: 2018-10-183URL: https://www.mysql.com/, Accessed: 2018-10-184URL: https://www.oracle.com/database/index.html, Accessed: 2018-10-185URL: https://www.postgresql.org/, Accessed: 2018-10-186URL: https://www.sqlite.org/index.html, Accessed: 2018-10-187URL: https://www.sap.com/index.html, Accessed: 2018-10-18

    9

    https://www.firebirdsql.org/https://www.microsoft.com/en-us/sql-serverhttps://www.mysql.com/https://www.oracle.com/database/index.htmlhttps://www.postgresql.org/https://www.sqlite.org/index.htmlhttps://www.sap.com/index.html

  • 2.6. SQLAlchemy

    __tablename__ = ’ course ’id = Column ( Integer , primary_key=True )course_name = Column ( S t r i n g ( 5 0 ) , n u l l a b l e =Fa l se )course_id = Column ( S t r i n g ( 1 2 ) , unique=True , n u l l a b l e =Fa l se )

    def _ _ i n i t _ _ ( name , c id ) :course_name = namecourse_id = cid

    c l a s s Lecture ( Base ) :__tablename__ = ’ l e c t u r e ’id = Column ( Integer , primary_key=True )course = Column ( Integer , ForeignKey ( ’ course . id ’ ) , n u l l a b l e =Fa l se )room = Column ( Integer , ForeignKey ( ’room . id ’ ) , n u l l a b l e =Fa l se )date = Column ( Date , n u l l a b l e =Fa l se )

    def _ _ i n i t _ _ ( c , r , d ) :course = croom = rdate = d

    # c r e a t e t h e t a b l eBase . metadata . c r e a t e _ a l l ( engine )

    #Add new roomsroom1 = Room( " room1 " )room2 = Room( " room2 " )room3 = Room( " room3 " )Sess ion . add ( room1 )Sess ion . add ( room2 )Sess ion . add ( room3 )Sess ion . commit ( )

    #Add new c o u r s e scourse1 = Course ( " I n t r o . C++" , "ABC123" )course2 = Course ( " I n t r o . Python " , "ABC484" )Sess ion . add ( course1 )Sess ion . add ( course2 )Sess ion . commit ( )

    #Add new l e c t u r e sl e c t u r e 1 = Lecture ( course1 . id , room2 . id , " 2018´10´15 " )l e c t u r e 2 = Lecture ( course1 . id , room3 . id , " 2018´10´18 " )l e c t u r e 3 = Lecture ( course2 . id , room1 . id , " 2018´11´5 " )Sess ion . add ( l e c t u r e 1 )Sess ion . add ( l e c t u r e 2 )Sess ion . add ( l e c t u r e 3 )Sess ion . commit ( )

    # P r i n t a l l room names f o r e v e r y l e c t u r e in t h e d a t a b a s efor row in Sess ion . query ( Lectures ) . a l l ( ) :

    room = Sess ion . query (Room ) . f i l t e r (Room . id == row . room ) . f i r s t ( )print ( room . room_name )

    10

  • 2.6. SQLAlchemy

    # D e l e t e some o f t h e rowsSess ion . d e l e t e ( l e c t u r e 3 )Sess ion . d e l e t e ( room1 )Sess ion . commit ( )

    11

  • 3 Method

    This chapter explains the method used to answer the research questions "How well do differentdatabase management systems perform compared to each other using SQLAlchemy?" and "How doesSQLAlchemy, using MySQL as database management system, compare to a pure MySQL implemen-tation?". Section 3.1 discusses design choices for the implementation, and section 3.2 explainsthe testing procedure for the research questions.

    3.1 Design

    The design of the database is split into a database design and database interface. These twoare connected as the database interface uses the tables designed in database design whenquerying the database.

    In this section, we will discuss the design choices made when implementing the databasedesign and the core functionality of the database interface.

    Database Design

    The database design used for our research questions was made using a requirement specifi-cation. This requirement specification was created with the idea that Linköping Universitywants to keep track of their software, courses, and computer rooms. It also wants to keeptrack of any connections between them such as which course uses which software and so on.Therefore, the database design was made with the general idea being.

    The database shall store all available software, all courses run by the university and allrooms that have computers connected to the system in them. The database shall also storeinformation about which available software has been requested by which course, the peoplethat can administer the courses, and the installed software for each room.

    From this general idea, a database design was created using relational models and nor-malisation to reach 3NF. The design was then made into an EER-model which can be seenin figure 3.1. Since most of the data in this database are in a string format, which is hardto use as an identifier when working with databases. The primary keys in the database arenamed ID, while all names ending with an id are foreign keys except for course_id in Course.All foreign keys in the design are set to cascade on delete. Therefore, whenever a record in

    12

  • 3.1. Design

    Figure 3.1: EER-model of the database design

    a foreign table is deleted, all record with a foreign key to that record’s attribute will also bedeleted.

    Lastly, the database is set so that auto-commit is on. This means that whenever a change isdone to any of the tables, it is immediately committed, which in turn means that the databaseis better suited to conduct quick changes and not suited for changes made in batches.

    Database Interface

    The database interface uses the database design to create functions that make the correctdatabase queries so that the user does not have to write them every time. Therefore, for everytable in figure 3.1 there exists an insert function, a delete function, an update function, and afind function.

    The insert function is simple since it merely has to insert a record into the table. For tablessuch as Available Software, Room, and Course the insert is simple. While for the other tablesthe insert also has to consider the foreign keys and make sure they are correct before insertingthe record.

    The delete function has to consider a few key points when deleting a record form a table.For example, Installed Software has foreign keys to both the Room and Available Softwaretables for its attributes. This means that if a record in Available Software exists in InstalledSoftware and the delete function tries to delete said software from Available Software. It hasto first delete all records with that software from Installed Software before deleting the recordin Available Software. Otherwise, it will cause constraint errors because one or more record’sforeign key in Installed Software has lost their source. This is usually a pretty complicatedsituation when deleting records as several queries have to be made. However, since we usecascade on delete in our database design, we do not have to worry about this and can thusdelete records in all tables without having to worry about these constraint errors.

    13

  • 3.2. Testing

    Table 3.1: Hardware and OS used for testing the DBMS.

    Component Model

    CPU Intel Core i3 2310M @ 2.1GHzRAM 4 GB Samsung SODIMM DDR3 @ 1333 MHzSSD Samsung EVO 840 250GBOS Ubuntu 18.04

    The find function is merely searching for specific attribute values in a table and returningthe results. In Available Software or any other table that doesn’t have foreign keys, it isn’tnecessary to do anything complex. However, in a table like Installed Software, depending onthe input, it has to first figure out the proper id:s for the targeted software and room, beforeperforming the search.

    The update function first searches for a targeted record before changing some of its values.The first part is just like the find functions, while the second part merely changes the valuesbefore committing the changes to the database.

    3.2 Testing

    This section discusses the testing to be able to answer the research questions. The databasedesign and interface were implemented in both Python using SQLAlchemy and SQL forMySQL. For the first research question, we chose to use MySQL, PostgreSQL and SQLite asDBMS. We chose these DBMSs because they are available and compatible with SQLAlchemy.For the second research question, we went with MySQL because we had previous experiencewith it. The hardware and operating system used for testing can be seen in table 3.1.

    Data generation

    To be able to perform tests, data for filling the database had to be generated. We created ascript for generating data, that accepted an integer N for how many records we wanted togenerate. The tables in the database had relatively many string attributes. We, therefore,decided that each string-attribute should be filled with its maximum length. The generatedstring used a variety of upper case and lower case characters, as well as digits. After the scripthad generated N records, it wrote the records to a file on the disk in JSON format. Listing 3.1shows an example of how data was generated for one of the tables.

    Listing 3.1: Data generating example

    def generate_room_data (N) :room_names = [ ]for _ in range (N) :

    room_name = random_text_generator ( length =12)room_names . append ( room_name )

    data = { " rooms " : room_names }f i lename = s t r (N) + " _room . j son "w r i t e _ j s o n _ d a t a _ t o _ f i l e ( fi lename , data )

    Time test

    For both research questions, we ran time-tests for inserts, deletes, updates and search usingPython’s time library. We measured time because we believe it is one of the most relevantmetrics for database testing. For inserts, we used the size N = 100, 1000, 10000 and 100000.

    14

  • 3.2. Testing

    For deletes, updates and search we used 50%, 25% and 20% of N respectively. Data wasgenerated for these sizes of N for all types of operation. We ran the test three times for eachtype of operation and on all DBMS for N = 100, 1000 and 10000 to get an average. Because oflimited time, we were only able to run the test for N = 100000 once.

    Listing 3.2 shows an example of how the insert test was carried out for the SQLAlchemyimplementation. The MySQL implementation looks very similar with the only difference be-ing how the generated data was loaded and how they were inserted. When loading the datafor the SQLAlchemy implementation, the data was read from the file and stored in memoryas a dictionary of lists for each table, whereas the MySQL implementation loaded the filenames that the data was stored in into a list. When they then were inserted, the SQLAlchemyindexed on the dictionary to get the correct list of data for each table, whereas the MySQLimplementation iterated through the list of files for each table and inserted them via shellthrough Python using the function shown in listing 3.3.

    Listing 3.2: Insert time test example for SQLAlchemy implementation.

    def i n s e r t _ t e s t (N) :# Load t h e g e n e r a t e d d a t adata = get_ j son_data (N)

    # Measure t h e t ime t h e i n s e r t t a k es t a r t _ t i m e = time . time ( )i n s e r t _ r e c o r d s ( data )end_time = time . time ( )

    # C a l c u l a t e t h e a c t u a l t ime i t t o o kt o t a l _ t i m e = end_time ´ s t a r t _ t i m ereturn t o t a l _ t i m e

    Listing 3.3: Insert for the MySQL implementation.

    def i n s e r t _ r e c o r d s ( s q l _ f i l e s ) :for f i l e in s q l _ f i l e s :

    command = " mysql ´́ login´path=exjobb D́ exjobb < " + f i l esubprocess . run (command, s h e l l =True )

    Code comparison

    For the second research question, we wanted to compare the code for both the SQLAlchemyimplementation and the MySQL implementation. We did this because we wanted to get afeeling of how complex the code itself was and it there was any significant difference betweenthe two implementations. We created a script that, given a file, calculated the number of lines,the number of lines that were comments, and the number of lines that were expressions. Thescript ran on both the design and interface file for both implementations.

    15

  • 4 Results

    This chapter presents the results from the testing. Section 4.1 presents the results for the timetests and section 4.2 presents the results for the code comparison.

    4.1 Time results

    Table 4.1, 4.2, 4.3, and 4.4 shows the time it took to insert, delete, search and update for alltest runs and sizes. The column Pure MySQL is the MySQL implementation, and the rest ofthe columns are the SQLAlchemy implementation.

    Figure 4.1, 4.2, 4.3, and 4.4 compares the average time for insert, delete, search and update.Note that the figures are log-scaled on the Y-axis.

    Figure 4.1 compares the average insert time for all sizes. Overall, we see similar trendsfor the SQLAlchemy implementation’s DBMSs for all sizes. PostgreSQL has the lowest inserttime, SQLite has the largest insert time, and MySQL stays somewhere in between the twoother. The MySQL implementation has an entirely different story. It starts of being fasterthan all DBMSs of the SQLAlchemy implementation, but as insert-size increases, so does theinsert-time. At size 100000, the MySQL implementation performs the worst out of all DBMS.

    Table 4.1: Result data from the insert time tests

    DMBS SQLite (sec) PostgreSQL (sec) MySQL (sec) Pure MySQL (sec)

    100Test-run 1 10.447 4.864 7.498 4.193Test-run 2 10.757 4.825 7.562 4.188Test-run 3 10.960 5.117 7.597 4.272

    1000Test-run 1 112.882 49.286 76.189 46.590Test-run 2 110.997 48.299 76.224 46.737Test-run 3 109.198 47.953 77.213 45.819

    10000Test-run 1 1171.439 483.592 777.567 794.455Test-run 2 1208.083 479.709 814.619 840.806Test-run 3 1197.372 480.752 815.737 847.164

    100000 Test-run 1 12649.407 5056.615 8450.248 38292.714

    16

  • 4.1. Time results

    Table 4.2: Result data from the delete time tests

    DMBS SQLite (sec) PostgreSQL (sec) MySQL (sec) Pure MySQL (sec)

    50Test-run 1 5.862 2.854 4.656 2.231Test-run 2 5.919 2.944 4.796 2.315Test-run 3 6.112 3.042 4.836 2.216

    500Test-run 1 61.416 29.757 47.957 28.404Test-run 2 58.998 31.738 47.919 27.892Test-run 3 61.440 30.760 47.603 28.240

    5000Test-run 1 671.703 343.631 503.189 702.767Test-run 2 682.714 341.233 501.535 700.427Test-run 3 683.221 342.139 501.880 702.428

    50000 Test-run 1 8255.309 6869.434 7242.145 48889.029

    Table 4.3: Result data from the search time tests

    DMBS SQLite (sec) PostgreSQL (sec) MySQL (sec) Pure MySQL (sec)

    25Test-run 1 0.247 0.428 0.441 0.142Test-run 2 0.240 0.429 0.453 0.144Test-run 3 0.251 0.459 0.487 0.143

    250Test-run 1 6.058 8.724 10.889 0.952Test-run 2 6.023 8.537 10.627 0.951Test-run 3 6.002 8.525 10.556 0.923

    2500Test-run 1 459.276 462.970 792.956 67.528Test-run 2 457.630 462.830 791.180 67.047Test-run 3 457.455 463.485 791.335 67.656

    25000 Test-run 1 67050.430 44951.685 63668.016 6409.435

    Table 4.4: Result data from the update time tests

    DMBS SQLite (sec) PostgreSQL (sec) MySQL (sec) Pure MySQL (sec)

    20Test-run 1 3.071 1.543 2.534 1.173Test-run 2 2.942 1.614 2.585 1.203Test-run 3 3.190 1.731 2.485 1.189

    200Test-run 1 31.857 16.715 25.869 15.282Test-run 2 30.049 16.453 25.506 14.977Test-run 3 31.392 16.276 25.525 15.012

    2000Test-run 1 322.306 183.531 271.612 440.618Test-run 2 313.813 180.310 275.019 441.491Test-run 3 309.071 181.608 271.101 439.582

    20000 Test-run 1 4242.794 2967.320 3898.965 34136.219

    17

  • 4.1. Time results

    100 1000 10000 100000

    101

    102

    103

    104

    Size

    Ave

    rage

    tim

    e(s

    econ

    ds)

    SQLite PostgreSQL MySQL Pure MySQL

    Figure 4.1: Average insert time.

    In figure 4.2 the comparison of the average delete time for all sizes is shown. Regardingthe SQLAlchemy, PostgreSQL has the lowest average time in all sizes while SQLite has thehighest. However, concerning the MySQL implementation, that is not always the case. Forthe sizes of 100 and 1000, Pure MySQL has the lowest averages. However, during the rest ofthe sizes, PostgreSQL has the lowest average while MySQL has the third highest for 5000 andthe highest for 50000.

    50 500 5000 50000100

    101

    102

    103

    104

    105

    Size

    Ave

    rage

    tim

    e(s

    econ

    ds)

    SQLite PostgreSQL MySQL Pure MySQL

    Figure 4.2: Average delete time.

    Figure 4.3 shows the average time the search-test took. The trends are similar for Post-greSQL and MySQL for all sizes. PostgreSQL is faster than MySQL. SQLite starts of beingfaster than the other SQLAlchemy implementations, but equally well or worse at larger sizes.The MySQL implementation out-performs the SQLAlchemy implementation for all sizes.

    Figure 4.4, has the comparison of averages for updates. In the figure, it is possible to seethe trend of PostgreSQL having the lowest average while SQLite has the highest regarding the

    18

  • 4.1. Time results

    20 200 2000 20000

    10´1

    100

    101

    102

    103

    104

    105

    Size

    Ave

    rage

    tim

    e(s

    econ

    ds)

    SQLite PostgreSQL MySQL Pure MySQL

    Figure 4.3: Average search time.

    SQLAlchemy implementation. The MySQL implementation has the lowest average duringthe first half of sizes while it has the highest during the latter half.

    25 250 2500 25000

    100

    101

    102

    103

    104

    Size

    Ave

    rage

    tim

    e(s

    econ

    ds)

    SQLite PostgreSQL MySQL Pure MySQL

    Figure 4.4: Average update time.

    Figure 4.5, 4.6, 4.7, and 4.8 compares how many records per second insert, delete, searchand update does. In figure 4.5, 4.6 and 4.8, the trends are more or less identical for allDBMSs and sizes. PostgreSQL dominate the SQLAlchemy implementations, MySQL comesin second place, and SQLite is at the bottom. The MySQL implementation out-performs theSQLAlchemy implementation at lower sizes but is out-performed at bigger sizes. Figure 4.7shows the number of records searched per second. We can see that SQLite performs the bestout of the SQLAlchemy implementation and that the MySQL implementation dominates theSQLAlchemy implementation by a large margin.

    19

  • 4.1. Time results

    100 1000 10000 100000

    5

    10

    15

    20

    25

    Size

    Rec

    ords

    per

    seco

    nd

    SQLite PostgreSQL MySQL Pure MySQL

    Figure 4.5: Inserts per second.

    50 500 5000 50000

    0

    5

    10

    15

    20

    Size

    Rec

    ords

    per

    seco

    nd

    SQLite PostgreSQL MySQL Pure MySQL

    Figure 4.6: Deletes per second.

    20

  • 4.1. Time results

    20 200 2000 20000

    0

    50

    100

    150

    200

    Size

    Rec

    ords

    per

    seco

    nd

    SQLite PostgreSQL MySQL Pure MySQL

    Figure 4.7: Searches per second.

    25 250 2500 25000

    0

    5

    10

    15

    20

    Size

    Rec

    ords

    per

    seco

    nd

    SQLite PostgreSQL MySQL Pure MySQL

    Figure 4.8: Updates per second.

    21

  • 4.2. Code comparison results

    Table 4.5: Result data from the code comparison test.

    File Lines Comments ExpressionsSQLAlchemy database design 101 0 101MySQL database design 73 0 73SQLAlchemy database interface 675 132 543MySQL database interface 531 118 413

    4.2 Code comparison results

    Table 4.5 shows the results of the code comparison. We can see that the MySQL implementa-tion has fewer lines of code in both the design file and the interface file than the SQLAlchemyimplementation.

    22

  • 5 Discussion

    This chapter discusses the results and method in section 5.1 and 5.2 respectively. The work ina wider context is also discussed in section 5.3.

    5.1 Results

    Research question 1

    A small surprise that happened during testing was the results for PostgreSQL. We had as-sumed that PostgreSQL and MySQL would be almost equal in time performance. However,the results show that PostgreSQL outperformed MySQL by large margins in every categorytested. This could be due to the difference in translation from Python to the DBMS doneby SQLAlchemy. It could also be related to the driver used when sending the queries tothe DBMS. Therefore, this result does not mean that PostgreSQL is better than MySQL as aDBMS, but that SQLAlchemy is better at using PostgreSQL than MySQL.

    There is another DBMS other than SQLite, PostgreSQL and MySQL called Firebird that wewanted to test as well. Because of how complicated it was to set up Firebird and the fact thatwe did not get it to work, we chose to not include it in the list of DBMSs we tested. The factthat we did not test Firebird does not change any of the results we got for the other DBMSs,but it would have been interesting to see how Firebird handled the test.

    Research question 2

    The most challenging part of implementing a design in two different languages is making theimplementations as similar as possible.

    One thing to note is that we believe that SQLAlchemy optimises the queries before ex-ecuting them, whereas we have not done any optimisation for the MySQL implementa-tion. Thus, the time it took to insert, delete and update records in the database with theMySQL implementation performs worse as the size increased. When looking at the resultfor the SQLAlchemy implementation, the time it took to do the same inserts, deletes andupdates only get larger like you would expect, since more queries equal longer time. TheSQLAlchemy implementation does not scale as poorly as the MySQL implementation does.

    23

  • 5.2. Method

    Even though the MySQL implementation scaled poorly, it definitively performed thebest when searching for records. This could be due to the way we searched for records inthe SQLAlchemy implementation, or just that a pure MySQL implementation is faster thanSQLAlchemy at searching.

    The result from comparing the number of lines, comments, and expressions shows thatPython code uses up more space than SQL. This is not that surprising as SQL is a languagepurely focused on database queries, unlike Python which has a wide array of uses. For exam-ple, in SQL you can do a lot of expressions and statements in a single query, while in Pythonit may be more efficient to split some expressions and statements into functions and then usethe result to do the actual query. SQL also has handles errors for its queries automatically,unlike Python in which you have to manually handle any errors that pop up as well as pos-sibly doing a database rollback in some situations. These are some of the reasons that lead toPython taking up more lines and expressions.

    Another thing to discuss regarding the difference between writing in Python or SQL isthe ease of use. When combining a DBMS with another program, the programmer is boundto switch between using the SQL language and their other programming language. How-ever, if we take Python in this case, then with the possibility of using SQLALchemy it is nolonger necessary to program in two languages at the same time. For example, when usingSQLAlchemy, it is possible to use Python syntax when creating query functions. This meansthat the programmer is not only on home-ground but that the programmer can combine otherPython functionalities when issuing the queries. This would arguably increase the effective-ness of the programmer compares to needing reasonable knowledge on both SQL and Pythonto accomplish the same task. All that is needed for a programmer to use SQLAlchemy is basicknowledge about databases and queries, and SQLAlchemy will handle the rest.

    5.2 Method

    Hardware

    One interesting factor is the hardware used for testing. We only had access to an old laptop,meaning that the results could potentially be better if we had a better computer to run thetests for all DBMSs on. If someone were to replicate our work on other hardware, the resultscould vary depending on the what hardware is used. On the other hand, the results shouldalways be the same on a particular hardware set-up, no matter what the hardware is.

    Database management system

    Like mentioned earlier, we chose SQLite, PostgreSQL and MySQL as our DBMSs. We chosethese because they were available and compatible with SQLAlchemy. If we had had moretime and money, we would have liked to test all the other DBMSs mentioned in 2.6 to get aneven better understanding of which DBMS performed the best using SQLAlchemy.

    Time testing

    Limited time was the biggest factor when investigating the first research question. This wasfor both getting the average time the test took and testing an even more massive amount ofrecords to be inserted, deleted, updated and searched.

    We wanted to get the average time it took to insert, delete, update and search records,which we were only able to get three samples of. To get the true average time it took, wewould have liked to get at least 30 samples.

    We were only able to test the DBMSs for sizes of 100, 1000, 10000, and 100000. Initially, wewanted to also test for size of 1000000. After the test for 100000 was done, we calculated that

    24

  • 5.3. The work in a wider context

    in the best case scenario, testing for size 1000000 would have taken about one month for allDBMSs, which we did not have time for.

    5.3 The work in a wider context

    Comparing databases against each other does not entail any ethical discussions in our opin-ion. However, data collection is strongly connected to our work in a wider context, sincemost data is stored in databases. In the modern digital era, companies collect data aboutpeople who visit the companies website or use their products. It is important that people cancontrol the information that is gathered and how it is used. An example is when Facebook letthird-party app-developers collect data which they didn’t need. This led to the CambridgeAnalytica scandal [11, 12] where an app-developer provided personal data about Facebookusers to Cambridge Analytica, which they used for marketing purposes. In May 2018, theEU legislation General Data Protection Regulation (GDPR) was enforced, which purpose isto help people have better control over their information online. We believe this is a step inthe right direction, however, it is still is not enough as GDPR is only enforced in the EU asopposed to being a worldwide regulation.

    25

  • 6 Conclusion

    This chapter concludes the thesis by answering the research questions. Section 6.1 concludeswhich DBMS performed the best regarding time, and section 6.2 concludes the code compar-ison.

    6.1 Time test

    For the first research question, the results show that PostgreSQL was the fastest in the major-ity of all tests. PostgreSQL only fell short on the search test where SQLite was faster at lowersizes. Therefore, we conclude that PostgreSQL is the best option for our database design andSQLAlchemy implementation.

    For the second research question, the MySQL implementation performed better than theSQLAlchemy implementation at lower sizes but worse at larger sizes. I.e., the MySQL imple-mentation scaled worse than the SQLAlchemy implementation. We believe this is becauseSQLAlchemy optimises its queries, which we did not do for the MySQL implementation.Therefore, we conclude that SQLAlchemy is the better option for the average SQL-user.

    6.2 Code comparison

    As the results show, the MySQL implementation can be written with fewer lines of code.With this said, the difference between the MySQL implementation and the SQLAlchemy im-plementation is not big enough to exclude SQLAlchemy as an option. We, therefore, concludethat SQLAlchemy is the better option since it performed better as the time test concluded.

    6.3 Future work

    For future work on this project, it may be interesting to consider adding more relationaldatabases in the comparisons that SQLAlchemy supports but we could not afford. This willgive a better perspective on the optional DBMS when using SQLAlchemy. Also, using non-relational databases such as NoSQL is another option to study as it is considered more mod-ern than relational databases.

    26

  • 6.3. Future work

    Another thing to consider would be to increase the number of tables in the database usedin this research. The current database is very small compared to big enterprises. Therefore, itwould be good to also run these tests on an enterprise sized database too to get results closerto reality.

    It would also be worthwhile to find other tools similar to SQLAlchemy. That way, it wouldbe possible to perform the same research on those tools as for SQLAlchemy and thus createthe opportunity to compare SQLAlchemy with its possible competitors.

    Lastly, it would be of great interest to further investigate the reasons for the results re-garding the Pure MySQL implementation and possibly correcting any issues found. This isto either give a concrete reason for the nonlinear results found in 4 or redo the tests with abetter version of the MySQL implementation.

    27

  • Bibliography

    [1] Gene Bellinger, Durval Castro, and Anthony Mills. “Data, information, knowledge, andwisdom”. In: (2004).

    [2] Keith F Punch. Introduction to social research: Quantitative and qualitative approaches. sage,2013.

    [3] Ramez Elmasri and Sham Navathe. Database systems : models, languages, design andapplication programming. Boston, Mass. ; Singapore : Pearson, cop. 2011, 2011. ISBN:9780132144988. URL: https://login.e.bibl.liu.se/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=cat00115a&AN=lkp.575458&lang=sv&site=eds-live&scope=site.

    [4] Rick Cattell. “Scalable SQL and NoSQL data stores”. In: Acm Sigmod Record 39.4 (2011),pp. 12–27.

    [5] Neal Leavitt. “Will NoSQL databases live up to their promise?” In: Computer 43.2 (2010).

    [6] Edgar F Codd. “A relational model of data for large shared data banks”. In: Communi-cations of the ACM 13.6 (1970), pp. 377–387.

    [7] Toby J Teorey, Dongqing Yang, and James P Fry. “A logical design methodology forrelational databases using the extended entity-relationship model”. In: ACM ComputingSurveys (CSUR) 18.2 (1986), pp. 197–222.

    [8] William Kent. “A simple guide to five normal forms in relational database theory”. In:Communications of the ACM 26.2 (1983), pp. 120–125.

    [9] Moussa Demba. “Algorithm for relational database normalization up to 3NF”. In: In-ternational Journal of Database Management Systems 5.3 (2013), p. 39.

    [10] Elizabeth J O’Neil. “Object/relational mapping 2008: hibernate and the entity datamodel (edm)”. In: Proceedings of the 2008 ACM SIGMOD international conference on Man-agement of data. ACM. 2008, pp. 1351–1356.

    [11] Robinson Meyer. The Cambridge Analytica Scandal, in Three Paragraphs. 2018. URL:https://www.theatlantic.com/technology/archive/2018/03/the-cambridge-analytica-scandal-in-three-paragraphs/556046/ (visitedon 02/09/2019).

    28

    https://login.e.bibl.liu.se/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=cat00115a&AN=lkp.575458&lang=sv&site=eds-live&scope=sitehttps://login.e.bibl.liu.se/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=cat00115a&AN=lkp.575458&lang=sv&site=eds-live&scope=sitehttps://login.e.bibl.liu.se/login?url=https://search.ebscohost.com/login.aspx?direct=true&db=cat00115a&AN=lkp.575458&lang=sv&site=eds-live&scope=sitehttps://www.theatlantic.com/technology/archive/2018/03/the-cambridge-analytica-scandal-in-three-paragraphs/556046/https://www.theatlantic.com/technology/archive/2018/03/the-cambridge-analytica-scandal-in-three-paragraphs/556046/

  • Bibliography

    [12] Alvin Chang. The Facebook and Cambridge Analytica scandal, explained with a simple dia-gram. 2018. URL: https://www.vox.com/policy-and-politics/2018/3/23/17151916/facebook-cambridge-analytica-trump-diagram (visited on02/09/2019).

    29

    https://www.vox.com/policy-and-politics/2018/3/23/17151916/facebook-cambridge-analytica-trump-diagramhttps://www.vox.com/policy-and-politics/2018/3/23/17151916/facebook-cambridge-analytica-trump-diagram

    AbstractAcknowledgmentsContentsList of FiguresList of TablesIntroductionBackgroundAimResearch questionsDelimitations

    TheoryDataDatabaseDatabase Management SystemsDatabase designObject-Relational MappingSQLAlchemy

    MethodDesignTesting

    ResultsTime resultsCode comparison results

    DiscussionResultsMethodThe work in a wider context

    ConclusionTime testCode comparisonFuture work

    Bibliography