data structure and algorithm using python · file manipulation and regular expression. open and...

48
DATA STRUCTURE AND ALGORITHM USING PYTHON Peter Lo File Manipulation and Regular Expression

Upload: others

Post on 08-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

DATA STRUCTURE AND

ALGORITHM USING PYTHON

Peter Lo

File Manipulation and Regular Expression

Page 2: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Open and Read Text File

File Manipulation

2Data Structure and Algorithm using Python @ Peter Lo 2019

Page 3: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

File

3

There are two separate types of files that Python

handles:

Binary file

Text files.

Knowing the difference between the two is important

because of how they are handled.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 4: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Binary File

4

Most files that you use in your computer are binary files. For example, Microsoft Word .doc file is a binary file, even if it has text in it.

Other examples of binary files include:

Image files including .jpg, .png, .bmp, .gif, etc.

Database files including .mdb, .frm, and .sqlite

Documents including .doc, .xls, .pdf, and others.

That’s because these files all have requirements for special handling and require a specific type of software to open it. For example, you need Excel to open an .xlsfile

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 5: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Text File

5

A text file has no specific encoding and can be opened by a

standard text editor without any special handling.

Every text file must adhere to a set of rules:

Text files have to be readable as is. They can contain a lot

of special encoding, especially in HTML or other markup

languages, but you’ll still be able to tell what it says

Data in a text file is organized by lines. In most cases,

each line is a distinct element, whether it’s a line of

instruction or a command.

Text files have some unseen character at the end of each line

which lets the text editor know that there should be a new line.

In Python, it is denoted by the “\n”.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 6: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Comma Separated Value (CSV) Text File

6

A CSV file is a comma separated values file commonly

used by spreadsheet programs such as Microsoft Excel.

It contains plain text data sets separated by commas

with each new line in the CSV file representing a new

database row and each database row consisting of one

or more fields separated by a comma.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 7: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Fixed-width Text File

7

Data in a fixed-width text file is arranged in rows and

columns, with one entry per row.

Each column has a fixed width, specified in characters,

which determines the maximum amount of data it can

contain.

No delimiters are used to separate the fields in the file.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 8: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Tab-Separated Values (TSV) Text File

8

A Tab-Separated Values (TSV) file (also called tab-

delimited file) is a simple text format for storing data in a

tabular structure, e.g., database table or spreadsheet

data, and a way of exchanging information between

databases.

Each record in the table is one line of the text file

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 9: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Opening a File in Python

9

In order to open a file for writing or use in Python, you

must rely on the built-in open() function.

The open() function takes two parameters; filename,

and access_mode.

The access_mode attribute of a file object tells you

which mode a file was opened in. And the filename

attribute tells you the name of the file that the file object

has opened.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 10: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

File Access Modes

10Data Structure and Algorithm using Python @ Peter Lo 2019

Mode Function

r Open file for reading only. Starts reading from beginning of file. This default mode.

rb Open a file for reading only in binary format. Starts reading from beginning of file.

r+ Open file for reading and writing. File pointer placed at beginning of the file

rb+ Open a file for reading and writing in binary format. File pointer placed at beginning of the file

w Open file for writing only. File pointer placed at beginning of the file. Overwrites existing file and

creates a new one if it does not exists.

wb Same as w but opens in binary mode.

w+ Same as w but also allows to read from file.

wb+ Same as wb but also allows to read from file.

a Open a file for appending. Starts writing at the end of file. Creates a new file if file does not exist.

ab Same as a but in binary format. Creates a new file if file does not exist.

a+ Same as a but also open for reading.

ab+ Same as ab but also open for reading.

Page 11: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

File Access Modes Summary

11Data Structure and Algorithm using Python @ Peter Lo 2019

Function r r+ w w+ a a+

Read ✓ ✓ ✓ ✓

Write ✓ ✓ ✓ ✓ ✓

Create new file if not exist ✓ ✓ ✓ ✓

Overwrite existing file ✓ ✓

Pointer place at beginning of file ✓ ✓ ✓ ✓

Point place at end of file ✓ ✓

Page 12: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

The File Object Attributes

12

Once a file is opened and you have one file object, you

can get various information related to that file.

Here is a list of all attributes related to file object:

Data Structure and Algorithm using Python @ Peter Lo 2019

Attribute Description

file.closed Returns true if file is closed, false otherwise.

file.mode Returns access mode with which file was opened

file.name Returns name of the file.

Page 13: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Closing a File in Python

13

When we are done with operations to the file, we need

to properly close the file.

Closing a file will free up the resources that were tied

with the file and is done using Python close() method.

Python has a garbage collector to clean up

unreferenced objects but, we must not rely on it to close

the file.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 14: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Example

14Data Structure and Algorithm using Python @ Peter Lo 2019

Page 15: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Exception Handling

15

If an exception occurs when we are performing some

operation with the file, the code exits without closing the

file.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 16: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

The try … except Block

16

Error handling in Python is done through the use of

exceptions that are caught in try blocks and handled in

except blocks.

If an error is encountered, a try block code execution is

stopped and transferred down to the except block.

In addition to using an except block after the try block,

you can also use the finally block.

The code in the finally block will be executed regardless

of whether an exception occurs.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 17: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

The try … except Block (cont.)

17

By using try ... finally block, we are guaranteed that the

file is properly closed even if an exception is raised,

causing program flow to stop.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 18: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Writing to File

18

The write() method writes any string to an open file. It is

important to note that Python strings can have binary

data and not just text.

The write() method does not add a newline character

('\n') to the end of the string.

Combine with file access mode, the target file can be

overwrite or append

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 19: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Example: Insert Data to a New File

19

By using “w” in access file, the data will be written to a

new file (or overwritten)

Content in file “Texting.txt”

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 20: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Example: Append Data to Existing File

20

By using “a” in access file, the data will be appended to

existing file

Content in file “Texting.txt”

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 21: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Reading from a File

21

To read a file in Python, we must open the file in reading

mode. There are several ways to read data from a file.

Data Structure and Algorithm using Python @ Peter Lo 2019

Method Description

read( ) Return specified number of characters from the file.

If omitted it will read the entire contents of the file.

readline( ) Return the next line of the file.

readlines( ) Read all the lines as a list of strings in the file

Page 22: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Read File using readline() method

22

If you want to read a file line by line, as opposed to

pulling the content of the entire file at once, then you use

the readline() function.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 23: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Read File using readlines() method

23

When you use readlines() for reading the file or

document line by line, it will separate each line and

present the file in a readable format.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 24: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Read File using read() method

24

If you need to extract a string that contains all characters

in the file, you can use the read() method:

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 25: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

File Pointer Location

25

We can change our current file cursor (position) using

the seek() method.

Similarly, the tell() method returns our current position

(in number of bytes).

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 26: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Parsing CSV Files

26

The csv library provides functionality to both read from

and write to CSV files.

Designed to work out of the box with Excel-generated

CSV files, it is easily adapted to work with a variety of

CSV formats.

The csv library contains objects and other code to read,

write, and process data from and to CSV files.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 27: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Reading CSV Files with csv

27Data Structure and Algorithm using Python @ Peter Lo 2019

Reading from a CSV file is done using the

reader object. The CSV file is opened as a

text file with Python’s built-in open()

function, which returns a file object. This

is then passed to the reader, which does the

heavy lifting.

Page 28: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Reading CSV Files into Dictionary with csv

28Data Structure and Algorithm using Python @ Peter Lo 2019

Rather than deal with a list of

individual String elements, you

can read CSV data directly into

a dictionary as well.

Page 29: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Writing CSV Files with csv

29

You can also write to a CSV file using a writer object and

the write_row() method:

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 30: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Quoting Optional Parameter

30

The quotechar optional parameter tells the writer which

character to use to quote fields when writing.

Whether quoting is used or not, however, is determined

by the quoting optional parameter.

Data Structure and Algorithm using Python @ Peter Lo 2019

Parameter Meaning

csv.QUOTE_MINIMAL .writerow() will quote fields only if they contain the

delimiter or the quotechar. This is the default case.

csv.QUOTE_ALL .writerow() will quote all fields.

csv.QUOTE_NONNUMERIC .writerow() will quote all fields containing text data and

convert all numeric fields to the float data type.

csv.QUOTE_NONE .writerow() will escape delimiters instead of quoting them.

In this case, you also must provide a value for the

escapechar optional parameter.

Page 31: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Example

31Data Structure and Algorithm using Python @ Peter Lo 2019

Page 32: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Writing CSV File from Dictionary with csv

32

Unlike DictReader, the fieldnames parameter is required

when writing a dictionary. It also uses the keys in

fieldnames to write out the first row as column names.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 33: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

A Simplified Guide

Regular Expression

33Data Structure and Algorithm using Python @ Peter Lo 2019

Page 34: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Regular Expression Module

34

A regular expression in a programming language is a

special text string used for describing a search pattern.

It is extremely useful for extracting information from text

such as code, files, log, spreadsheets or even

documents.

It is widely used in natural language processing, web

applications that require validating string and pretty

much most data science projects that involve text mining.

In python, it is implemented in the standard module re.

Data Structure and Algorithm using Python @ Peter Lo 2019

More information can be found in https://docs.python.org/3/library/re.html

Page 35: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

What is a regex pattern?

35

A regex pattern is a special language used to represent

generic text, numbers or symbols so it can be used to

extract texts that conform to that pattern.

Consider an example expression “\s+”.

Here the “\s” matches any whitespace character.

By adding a '+' notation at the end will make the

pattern match at least 1 or more spaces.

So this pattern will match even tab characters as well.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 36: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Split String Separated by regex

36

If you intend to use a particular pattern multiple times,

then you are better off compiling a regular expression

rather than using re.split over and over again.

Data Structure and Algorithm using Python @ Peter Lo 2019

The '\s' matches any whitespace character. By

adding a '+' notation at the end will make the

pattern match at least 1 or more spaces. This

pattern will match even tab '\t' characters as well

This file contain three column, but the separator are

different.

Page 37: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Greedy vs Non-greedy Matching

37

Greedy matching gets the longest results possible

Nongreedy matching gets the shortest possible

Consider an String = “123 ABC 456 xyz”

For greedy expression: \d+

◼ Result: ['123', '456’]

◼ Maximizes the length of \d

For non-greedy expression: \d+?

◼ Result: ['1', '2', '3', '4', '5', '6']

◼ Minimizes the length of \d

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 38: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Wildcards and Anchors

38

. (a dot) matches any character except \n

".oo.y" matches "Doocy", "goofy", "LooPy", ...

use \. to literally match a dot . character

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 39: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Wildcards and Anchors

39

^ matches the beginning of a line; $ the end

"^fi$" matches lines that consist entirely of fi

\< demands that pattern is the beginning of a word; \>

demands that pattern is the end of a word

"\<for\>" matches lines that contain the word "for"

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 40: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Boolean

40

| means OR

"abc|def|g" matches lines with "abc", "def", or "g"

precedence of ^(Subject|Date) vs. ^Subject|Date:

There's no AND symbol.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 41: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Grouping

41

() are for grouping

"(Homer|Marge)" matches lines containing "Homer" or

"Marge"

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 42: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Finding Matched Pattern

42

There are three method for searching matched pattern:

findall() returns the matched portions of the text as a

list

search() returns a particular match object that

contains the starting and ending positions of the first

occurrence of the pattern.

match() also returns a match object, but the

difference is, it requires the pattern to be present at

the beginning of the text itself.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 43: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Finding Pattern using findall

43

The findall() method extracts all occurrences of the 1 or

more digits from the text and returns them in a list.

Data Structure and Algorithm using Python @ Peter Lo 2019

the special character '\d' is a regular expression

which matches any digit. Adding a '+' symbol to it

mandates the presence of at least 1 digit to be

present in order to be found.

Page 44: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Finding Pattern using search

44

The search() method return the found value together

with its position.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 45: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Finding Pattern using match

45

If the pattern is not present at the beginning of the text

itself, the match() method unable to find the result.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 46: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Substitute Text with Another

46

To replace texts, use the sub() method.

Data Structure and Algorithm using Python @ Peter Lo 2019

Page 47: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Regular Expression Quick Guide

47Data Structure and Algorithm using Python @ Peter Lo 2019

Symbol Meaning

^ Matches the beginning of a line

$ Matches the end of the line

. Matches any character except line terminators like \n.

* Repeats a character zero or more times

*? Repeats a character zero or more times (non-greedy)

+ Repeats a character one or more times

+? Repeats a character one or more times (non-greedy)

( Indicates where string extraction is to start

) Indicates where string extraction is to end

Page 48: Data Structure and Algorithm using Python · File Manipulation and Regular Expression. Open and Read Text File File Manipulation Data Structure and Algorithm using Python @ Peter

Regular Expression Quick Guide

48Data Structure and Algorithm using Python @ Peter Lo 2019

Symbol Meaning

\s Matches whitespace

\S Matches any non-whitespace character

\d Matches a digit

\D Matches a non-digit

\w Matches alphanumeric characters which means a-z, A-Z, 0-9 and

underscore, _.

\W Matches a non-alphanumeric characters

\n Matches a new line

[abcde] Matches a single character in the listed set

[^xyz] Matches a single character not in the listed set

[a-z0-9] The set of characters can include a range