An intro to setup a Python development environment, the basic structures, basic Git and some popular libraries such as Numpy and Pandas.
Contributed by Ahmad Aghaebrahimian (ZHAW-ICLS)
Python was created by Guido van Rossum and was first released in 1991. It was named after Monty Python, a popular British comedy group. Python was designed to be an easy-to-read and easy-to-write language that emphasizes readability and code reusability.
In the late 1990s, Python became increasingly popular as a general-purpose programming language, due to its simplicity, versatility, and ability to run on almost any platform. This led to the development of a large number of third-party libraries, making it a popular choice for various applications, including scientific computing, data analysis, artificial intelligence, and web development. For scientific computing and data analysis, for instance, Python has a number of powerful libraries, including NumPy, which provides support for arrays and matrices, and Pandas, which provides fast and efficient data analysis and manipulation. In the field of Artificial Intelligence (AI), Python has a number of libraries too, including TensorFlow and PyTorch, which are used to build and train machine learning models. Additionally, the scikit-learn library provides a simple and efficient way to perform machine learning tasks, including regression, classification, and clustering. Last but not least, Python has a number of popular libraries, including Django, Flask, and FastAPI for web development.
In the early 2000s, Python gained even more popularity when it was included as one of the standard scripting languages for the popular Linux operating system. Today, Python is used by organizations such as NASA, Google, and Facebook, and it has become one of the most widely-used programming languages in the world.
Python has continued to evolve over the years, with the release of several major versions, each of which has brought new features and improvements to the language. The latest version of Python, Python 3, was released in 2008 and has since become the preferred version of the language, due to its improved performance and enhanced features. This overview uses Python 3 as the default version.
The first step for setting up Python development environment is to have a Python interpreter installed on the computer. This process may vary slightly based on the computer's operating system. Installing a virtual environment and an Integrated Development Environment (IDE) is done afterward. Please note that by creating a virtual enviroment as described in Step 2 below, the first step can be safely ignored since a virtual environment contains a Python interpreter by itself.
Linux: Most Linux distributions come with Python pre-installed. This can be checked by opening a terminal Ctrl+Alt+T
and running python --version
macOS: Python comes pre-installed on macOS too. Similar to Linux, one can check this in a terminal.
You can safely disregard the version of the Python that comes pre-installed by default. In the next step, you will set up a virtual environment, allowing you to create multiple Python environments with different versions as needed.
The next step in setting up a Python development environment is to create a virtual environment. Virtual environments are an essential tool for Python development. They help maintain a clean and organized development environment and ensure the stability and reproducibility of your projects. Furthermore, virtual environments isolate the development environment from the global Python interpreter, making it particularly beneficial, especially for beginners. There are several ways to create a virtual environment. Here, we use Anaconda for this purpose.
Download the Anaconda script compatible with your Linux distribution form here at the bottom of the page.
Run the script, accept the license, and accept all the default values, except the last one which asks Do you wish the installer to initialize Anaconda3
by running conda init?
which should be answered with a yes
When installed, you can open a 'new' terminal and run conda create -n my_new_env python==3.10
. Accept all the default options.
Activate your new environment conda activate my_new_env
and check the version.
my_new_env
is isolated from system altogether. You can safely remove it with conda remove -n my_new_env --all
and create a brand new one.
Alternatively, you can return to the base environment by deactivating the active environment.
jupyter notebook
Read more about installing Anaconda on Linux here.
For more information about Conda commands check this
Download the Anaconda Mac graphical installer form here at the bottom of the page.
Run the installer, accept the license agreement and all default values.
When installed, open a 'new' terminal and run conda create -n my_new_env python==3.10
. Accept all the default options. (for these steps you may observe similar images as the ones in the Linux part above)
Activate your new environment conda activate my_new_env
and check the version, python --version
.
my_new_env
is isolated from the system altogether. You can safely remove it with conda remove -n my_new_env --all
and create a brand new one.
Alternatively, you can return to the base environment by deactivating the active environment conda deactivate
.
A complete Anaconda installation comes with Jupyter Notebook, the most popular notebook system for Python. In Python, a notebook is a web-based interactive computational interface that enables you to run and experiment with Python code interactively. To run a notebook locally, open a terminal and run jupyter notebook
Read more about installing Anaconda on Mac here.
For more information about Conda commands check this
Download the Anaconda Windows graphical installer form here, bottom of the page.
Run the installer, accept the license agreement, and all default values. On the 'Advanced Installation Options' page, check all the fields.
When installed, you can work with environments in the command line. Open a command line prompt (CMD) in the start menu and run conda create -n my_new_env python==3.10
. Accept all the default options. (for these steps you may observe similar images as the ones in the Linux part above).
Activate your new environment conda activate my_new_env
and check the version, python --version
.
my_new_env
is isolated from the system altogether. You can safely remove it with conda remove -n my_new_env --all
and create a brand new one.
Alternatively, you can return to the base environment by deactivating the active environment conda deactivate
.
A complete Anaconda installation comes with Jupyter Notebook, the most popular notebook system for Python. In Python, a notebook is a web-based interactive computational interface that enables you to run and experiment with Python code interactively. To run a notebook locally, run CMD and run jupyter notebook
in it. You can also run it in the 'Anaconda Navigator' in the start menu.
Read more about installing Anaconda on Windows here.
For more information about Conda commands check this
By installing an Integrated Development Environment (IDE) in the next step, Anaconda will be integrated into the IDE where each project can be configured with a new environmet within the IDE.
The last step is to install a code editor or an Integrated Development Environment (IDE) such as Visual Studio Code, PyCharm, or IDLE (the built-in Python editor). In this tutorial, we install and use PyCharm as the default IDE.
The easiest way to install Pycharm in Linux(Ubuntu) is to use the Software Center. Other distributions have similar options to use. Search for Pycharm and install either Pycharm-community or Pycharm-EDU. This tutorial uses Pycharm-community.
Another alternative is to run sudo snap install pycharm-community --classic
in the terminal.
When installed, click on New Project
in the welcome window or in the File
menu on top of the page (if no welcome window has appeared).
PyCharm integrates with the Anaconda which has already been installed, allowing the creation a new environment with a specific Python version for each project. However, it's important to keep in mind that some libraries may not be compatible with certain Python versions. In that case, it's recommended to create environment with a Python version that is compatible with all the required libraries. With Conda installed, removing an environment with an incompatible Python version and creating a new, compatible environment is a straightforward process as described above.
The project automatically creates main.py
. One can execute it by right-clicking on the file and Run 'main.py'
or from Run
in the top menus.
if required, you can install and import new libraries. Click on Terminal
at the bottom of the page to open a terminal window. Note to the prompt which includes the active environment created when the project is initialized. Run conda install numpy
or pip install numpy
to install Numpy a powerful library for working with vectors and matrices.
Check what libraries are already installed in your environment by running conda list
or pip list
in the terminal.
if a library's version is incompatible, let's say Numpy, you can safely remove it from your environment with pip uninstall numpy
and install one with the correct version, let's say 1.24.0, with pip install numpy==1.24.0
.
When delivering your code to someone else or running it on other machines, to make sure that the code runs on exactly the same environment with the same library versions, you can generate a requirements file containing all libraries with their versions with conda list --export > requirements.txt
or pip list --format=freeze > requirements.txt
. This will create a new file requirements.txt
which should go along with other codes, letting the recipient replicate the same environment by installing exact libraries with pip install -r requirements.txt
.
Download the Pycharm installer from here. Select the Community version which is an open-source and free software. Also, select the dmg package compatible with your processor type (Intel or Apple Silicon).
Install Pycharm by running the installer. Accept all the default values.
When finished, you can run Pycharm by clicking on the newly made icon 'Pycharm CE' in the Applications folder.
Click on New Project
in the welcome window or in the File
menu on top of the page (if no welcome window has appeared). From now on, all the procedures are exactly similar to step 3 onward, in the Linux part above. Please follow the instruction from there on.
Download the Pycharm installer from here. Select the Community version which is an open-source and free software.
Install Pycharm by running the installer. Accept all the default values.
When finished, you can run Pycharm by clicking on the newly made icon 'Pycharm Community Edition' in the start menu.
Click on New Project
in the welcome window or the File
menu on top of the page (if no welcome window has appeared). From now on, all the procedures are exactly similar to step 3 onward, in the Linux part above. Please follow the instruction from there on.
Git is a distributed version control system that allows developers to track code changes, collaborate with others, and revert to previous versions of their work.
GitHub is a web-based platform (similar to Bitbucket or GitLab) that provides hosting for Git repositories, as well as additional collaboration tools like issue tracking, code review, and continuous integration.
Git commands are often run in the command line or terminal. However, it is nicely integrated with Pycharm and you can use it within your IDE. The first step to using Git is to make sure it's been already installed on the system. Git is available in Linux and MacOS by default. You can check this by opening a terminal and running git --version
. In Windows, you need to install Git from here. Download the installer for Windows, run, and accept all the default values except for the terminal emulator which is better to use the Windows default console window (like the image below). In Windows, when Git installation is done, you may need to restart your Pycharm IDE.
After creating a new project in Pycharm, you can open VCS menu and select 'Enable Version Control Integration'. In the next window, select 'Git'. Doing so, VCS in the top menu changes to Git (in Windows, it requires IDE restart). Git has numerous functionalities. Yet, here we describe the most basic ones; Commit, Push, and Clone.
The history of all changes made to code is stored by Git. After making modifications to the code, you can Commit
to save the progress. You may add all or only specific files for committing. When committing, it is necessary to provide a description that offers a clue about the changes you have made. These commit points serve as checkpoints for developers, allowing them to revert to earlier stages in case of errors.
Click on the Commit
in Git menu in your PyCharm. In the newly opened window, select all 'Unversioned Files'. In the provided space below the window, provide a note or message to remind you what has been changed in this commit. (e.g., Initial commit, Deleted my_file.txt). And hit the Commit button.
you can perform some changes (adding or deleting, or changing some code) and commit each time. Now in the terminal at the bottom of the Pycharm window, run git log
to see the history of your commits. Each entry in the log has a commit_id
(hash code) using which you can restore your code state to that commit by git checkout commit_id
Up to now, all tracking and progress saving has been done locally. To publish the code into a remote folder (i.e., repository) on GitHub for instance, one needs to open an account in GitHub first. This is a straightforward process. Just navigate to https://github.com/ and SingUp.
Next, within the PyCharm project, click on the menu Git -> GitHub -> Share Project on GitHub
Provide a repository name and description. Click 'Add account' to connect your PyCharm to your GitHub account. Select 'Log In via GitHub' which automatically navigates you to the GitHub Login page. There you may Authorize JetBrains (PyCharm) to connect to your GitHub Account. Provide your credentials and Authorize JetBrains IDE integration.
After this step, PyCharm connects to your GitHub accounts and creates a new repository there. Now, you can click on Push in the Git menu, select your commit within the Push windows, and hit the Push button. You can check your new repository in GitHub and observe the changes that have been updated there.
If you found other people's repositories interesting and decided to work on their code, you can simply Clone
their project by clicking on Clone under the Git menu (It is important to acknowledge the contribution of others by citing their work if you have utilized all or part of their code.) This will open a window asking for the URL of the original repository and the local address of where you want the copy the repository. The window will open the project after it copies the content. To learn more about Fork, Pull requests, branches, and many other Git functionalities check this document.
Python is a high-level language known for its readability and ease of use. Python uses indentation to define blocks of code, such as loops or functions. Indention may set to any number of empty characters (often 4 or 8) which should be consistent throughout the entire code.
Variables in Python are defined using the equal sign my_variable = 5
. They do not need a type to be specified. Python has several built-in data types, including numbers (integers and floats), strings, lists, dictionaries, and tuples. It supports the usual arithmetic operators (+, -, *, /, \%), comparison operators (==, !=, <, >, <=, >=), and logical operators (and, or, not). Conditional statements in Python are defined using the keywords if
, elif
, and else
. Python has a large standard library and also allows users to import external modules using the import
statement. The following sections will provide an overview of all these concepts along with some examples.
# These blocks with the play button on the left are code cells. If you are working in CodaLab, you can run these cells or copy/paste the codes into your IDE to see the outcome.
x = 10 # assign 10 to variable x as an integer
print (x)
print (type(x))
y = 'python' # assign 'python' to variable y as a string
print (y)
x = 20
print(x) # reassign 20 to x
Python supports several built-in data types, including:
bool
: Boolean values, which can be either True or False.
x = True # assign True to x
y = False # assign False to y
print(type(x)) # print the type of x
print(x is True) # 'is' is an identity operator, it checks if x is true or not and returns a Boolean value
print (x and y) # logical AND is a binary operation, meaning it requires two operands, it returns True only if two operands are True, otherwise False
print (x or y) # logical OR is a binary operation, it returns False only if two operands are False, otherwise True
print (not x) # logical NOT is a unary operation, meaning it requires only one operand, it returns the opposite of its operand
int
: Integer values, such as 1, 2, 3, etc.
x = 10 # assign 10 to x
print(type(x)) # print the type of x
x = x + 1 # increment x by one
x +=1 # increment x by one, exactly as above
y = 20
x = y # reassign x with y
print(x)
print(y % x) # modulo operator returns the remaining value of a division
float
: Floating-point numbers, such as 1.0, 2.5, 3.14, etc. To know more about int, float and other build-in numbers in Python, please check here.
x = 2.4 # assign 2.4 to x
print(type(x)) # print the type of x
x = x + 1 # increment x by one
x +=1.3 # increment x by 1.3, exactly as above
y = 2
print(type(y))
x *= y # x = x * y multiply x with y and assign the result to x
print(type(x)) # float = float * int
str
: String values, such as "hello", "world", etc. Strings can be declared using single or double quotes. Long strings in multiple lines can be declare with triple quotes. When working in IDE, adding a dot(.) in front of a variable trigers the autocomplete functionality of the IDE which show many availabe functions given that particular object. You can find more about string in here.
x = 'sample string 1'
y = 'sample string "2"' # embedd another string in a string
print(type(x)) # print the type of x
print (x + ' and ' + y) # adding two strings (concatination)
print (x.capitalize()) # increment x by 1.3, exactly as above
print(len(x))
print(x.replace('sample', 'new'))
print(f'Python strings can be formatted in different ways like this <{x}> and this <{y}>.')
print('This is another way to format {0} and {1}.'.format(x, y))
list
: Lists are ordered collections of values, which can be of any data type. For example: [1, 2, 3, "hello", [4, 5]]. Lists are mutable; means that you can add items to them or change its items.
list_of_bio_types = ['gene', 'drug', 'chemical', 'virus', 'illness'] # instantiate a list
print(type(list_of_bio_types))
list_of_bio_types.append('symptom') # lists may have items with different datatypes
print(list_of_bio_types)
list_of_bio_types.pop() # remove the last item
print(list_of_bio_types)
print(len(list_of_bio_types)) # length of list
print(list_of_bio_types[0]) # access the first item of a list; Python uses 0 indexing; index 0= first item, index 1=second item ...
print(list_of_bio_types[-1]) # access the last item of a list
print('-'*20)
# list slicing
print(list_of_bio_types[:2]) # items from the zero index to 2 (exclusive)
print(list_of_bio_types[1:3]) # items from the first index to 3 (exclusive)
print(list_of_bio_types[1:-2]) # items from the first index to the last second index (exclusive)
list_of_bio_types[:2] = ['cell', 'vitamin'] # replace multiple value
print(list_of_bio_types)
print('-'*20)
print(range(0, 10, 2)) #range is a built-in function that generates a list of numbers; here from 0, to 10 with 2 as step size. step size and begin parameters are optional.
print(list(range(0, 10, 2))) # The output of range function is an object of type range; to see the actual numbers it should be cast into list
print([x+100 for x in range(10)]) # list comprehension; an easy way to generate lists
print('-'*20)
# general indexing format: [begin index: end index: steps]; all three are optional
print(list_of_bio_types[2::2]) # from the second index to the last, every other item
print(list_of_bio_types[::-1]) # all list backward
dict
: Dictionaries are unordered collections of key-value pairs, where each key is mapped to a value. For example: {"key1": "value1", "key2": "value2"}.
sample_dictionary = dict() # instantiate a dictionary
sample_dictionary = {'1':'one', '2':'two', '3':'three', '4':'four', '5':'five'} # instantiate a dictionary
print(type(sample_dictionary))
sample_dictionary['6']= 'six' # add a new item or change the existing item of a dictionary
print(sample_dictionary)
print(len(sample_dictionary)) # length of dictionary
print(sample_dictionary['1']) # access the value of a key, if the key is not available, a KeyError will be issued
print(sample_dictionary.get('10', None)) # access the value of a key, if the key is not available, the specified default value (None here) is returned
print({str(x):x for x in range(10)}) # dict comprehension, an easy way to generate dictionaries
tuple
: Tuples are similar to lists, but they are immutable, meaning that their values cannot be changed once they are created. For example: (1, 2, 3, "hello", (4, 5)).
x = (4, 5, 4)
print(type(x))
print(x.index(5)) # get the index of an item
print(len(x))
x1, x2, x3 = x # one line assignment
print(x1, x2, x3)
print(x[1])
x[1] = 10 # Error; tuple values can not be reassigned
set
: Sets are unordered collections of unique values.
x = {4, 5, 6}
print(type(x))
x.add(6) # adding items that already exist does nothing
print(x)
x.remove(6)
print(x)
print(8 in x) # IN is a membership operator, it checks if x has 6 and returns a Boolean value accordingly
y = {5, 9, 10}
print(x.union(y))
print(x.intersection(y))
These data types can be combined to create more complex data structures, such as lists of dictionaries or dictionaries of lists. Additionally, you can also create your own custom data types by defining classes and objects.
x = [(4, 5), (6, 7), (8, 9)] # list of tuples
x = [{4:'four', 5:'five'}, {6:'six', 7:'seven'}, {8:'eight', 9:'nine'}] # list of dictionaries
x = {(6, 7):'six_seven', (7, 8):'seven_eight'} # dictionary of tuples
x = {(6, 7):['six', 'seven'], (7, 8):['seven', 'eight']} # dictionary of tuples and lists: dictionary keys should be immutable (not changeable), so tuples can be keys but lists can not
Operators are special symbols that perform specific operations on one or more operands. An operand is an object on which an operator operates.
There are several types of operators in Python:
Arithmetic Operators
perform basic arithmetic operations like addition, subtraction, multiplication(*), exponentiation(**), modulo(%, remaining of an integer devision), etc.
Comparison Operators
compare two values and return a Boolean value based on the comparison. For example: > for greater than, < for less than, == for equal to, etc.
Assignment Operators
are used to assign values to variables.
For example: = for assignment, += for addition and assignment, etc.
Logical Operators
perform operations on Boolean values and return a Boolean value based on the evaluation of a logical expression. For example: and
for logical AND, or
for logical OR, not
for logical NOT, etc.
Membership Operators
test for membership in a sequence, such as strings, lists, or tuples. For example: in
for membership. (see an example in set
examples above)
Identity Operators
compare the memory locations of two objects and return True if they are the same objects located at the same memory location, otherwise False. For example: is
for identity checking (see an example in bool
examples above)
Conditional statements allow you to control the flow of execution based on certain conditions. They allow you to check if a condition is true or false, and execute certain blocks of code based on the result.
# Singl condition
condition = True
if condition: # The condition can be any expression that returns a Boolean value, either True or False.
print('condition is True')
x = 7
y = 7
if (x>y):
print ('{0} is bigger than {1}'.format(x, y))
# Double conditions
if (x>y):
print ('{0} is bigger than {1}'.format(x, y))
else:
print ('{0} is smaller than or equal to {1}'.format(x, y))
# More than two conditions
if (x>y):
print ('{0} is bigger than {1}'.format(x, y))
elif (x<y):
print ('{0} is smaller than {1}'.format(x, y))
else:
print ('{0} is equal to {1}'.format(x, y))
Loops are an important construct in programming that allow you to repeat a block of code multiple times. In Python, there are two types of loops: for loops and while loops.
for
is used to iterate over a sequence of elements, such as a list, tuple, or string. list_of_bio_types = ['gene', 'drug', 'chemical', 'virus', 'illness']
for typ in list_of_bio_types: # iterating over the items of a list
print(typ)
print('-'*30)
for id, typ in enumerate(list_of_bio_types): # the built-in function, enumerate, returns the items' index along items from a sequence
print('Index: {}, type: {}'.format(id, typ))
dictionary_of_bio_types = {id:typ for id, typ in enumerate(list_of_bio_types)} # dictionary comprehension is a efficient way for generating dictionaries
print('-'*30)
for key in dictionary_of_bio_types.keys(): # iterating over dictionary keys. (keys() function returns a list of keys from a dictionary)
print('Key: {}, value: {}'.format(key, dictionary_of_bio_types[key]))
print('-'*30)
for value in dictionary_of_bio_types.values(): # iterating over dictionary values. (values() function returns a list of values from a dictionary)
print('Value: {}'.format(value))
print('-'*30)
for key, value in dictionary_of_bio_types.items(): # iterating over dictionary keys and values. (items() function returns a list of (key, item) tuples from a dictionary)
print('Key: {}, value: {}'.format(key, value))
while
loops are used to repeatedly execute a block of code as long as a certain condition is True. list_of_bio_types = ['gene', 'drug', 'chemical', 'virus', 'illness']
while (list_of_bio_types): # list with at least one item evaluates as True. With the last item removed from list, the empty list evaluates to False, hence while loop stops
print(list_of_bio_types.pop()) # pop function removes and returns the last item from list
It is important to be careful when using while loops, as they can run forever if the condition never becomes False.
Loop control mechanisms in Python allow you to control the flow of execution within a loop. They provide a way to exit a loop prematurely, skip the current iteration, or do nothing. break
, continue
, and pass
are some common loop control mechanisms in Python:
break
is used to exit a loop prematurely. When a break statement is encountered within a loop, the loop is immediately terminated and the program continues with the next statement after the loop.list_of_bio_types = ['gene', 'drug', 'chemical', 'virus', 'illness']
for typ in list_of_bio_types:
if typ == 'chemical':
break
print(typ)
continue
is used to skip the current iteration of a loop and move on to the next iteration. When a continue statement is encountered within a loop, the program continues with the next iteration of the loop, bypassing any code in the current iteration that comes after the continue statement.list_of_bio_types = ['gene', 'drug', 'chemical', 'virus', 'illness']
for typ in list_of_bio_types:
if typ == 'chemical':
continue
print(typ)
pass
is a placeholder statement in Python that does nothing. It's used when you have a piece of code that you want to include in your program, but you don't want it to do anything. For example, you might use pass as a placeholder in a loop or an if statement that you haven't implemented yet.list_of_bio_types = ['gene', 'drug', 'chemical', 'virus', 'illness']
for typ in list_of_bio_types:
pass
To be able to read or write a file in Python, one must first utilize the built-in open()
function to open it. This function generates a file
object that can be used to invoke other related methods.
This is a simple syntax for the open function:
file_obj = open(file_name, access_mode)
file_name
is an absolute or relative file address. access_mode
is the mode of the file to be opened (read, write, append, etc.)
To read more about Files check this
file_obj = open("example.txt", "w") # open a file and write some lines in it. 'w' means 'write a text file', other common modes: wb for write binary, r, for read text, rb for read binary, ...
print ("Name of the file: ", file_obj.name)
file_obj.write("This is the first line in my first file.\n")
file_obj.write("This is the second line in my first file.")
file_obj.close() # it is a best practise to close the file after finishing working on that
with open("example.txt", "r") as file_obj: # use this structure and do not need to close file_obj manually
for line in file_obj.readlines(): # file ojects has a function readlines which does exactly as the name suggests
print(line.rstrip()) # rstrip() removes the new line character at the end of each line
Functions in Python allow for organizing and reusing code, making it easier to write, test, and maintain large programs. Functions provide a way to encapsulate code blocks and perform specific tasks, allowing for modular and organized code. They also promote code reuse and reduce the risk of bugs by breaking down complex code into smaller, manageable pieces. Additionally, functions can make code easier to read and understand by giving descriptive names to code blocks and abstracting away implementation details.
Functions are defined using the def
keyword, followed by the function name, a set of parentheses, and a colon. The code inside the function is indented, and the function is executed when it is called.
def my_function(input_arguments):
"""
docstring: describe what this function does, what are possible inputs(maybe none) and what are possible outputs(maybe none)
"""
# do stuff
return 'if anything to return'
def compute_second_power(x): # function name as descriptive as possible
''' # docstring
input: x an integer
return: the second power of x
'''
result = x**2 # what function does
return result # return the results
print(compute_second_power(x=2)) # call a function
def compute_power(x, power=2): # functions may have optional arguments. Optional arguments have a default value when the function is defined. When called, if they are not provided with a value, the default value is used
'''
inputs:
x: an integer
power: an integer
return: the powers of x
'''
result = x**power
return result
print(compute_power(x=2))
print(compute_power(x=2, power=3))
def compute_power(x:int, power:int=2)->int: # functions may be typed to help static type checkers and better readability
'''
inputs:
x: an integer
power: an integer
return: the second power of x
'''
result = x**power
return result
print(compute_power(x=2))
print(compute_power(x=2, power=3))
Modules are collections of functions, variables, and other objects that can be imported into other Python files. Modules provide a way to organize related functions and objects into a single, importable file. They help to separate code into different, reusable components and reduce the risk of naming conflicts between different parts of a program. One can write several functions in a file, import
the file in another file where the functions defined in the imported file can simply be called.
Libraries are collections of modules that provide additional functionality for a programming language. In Python, libraries are usually distributed as packages and can be installed using package managers like pip (or conda as you did earlier). They provide a wide range of functionalities, including scientific computing, data analysis, machine learning, web development, and more. For example:
from numpy.linalg import svd # this line load svd(Singular Value Dicomposition) as a function from linalg (linear algebra) module within the numpy library
Sequential structure ( e.g., running lines of codes one after another), conditional structure (e.g., if/else
), repetitive structure (e.g., loops
), and function structure (e.g., def
) covered in this overview are four fundamental building blocks of any software program. They can be combined in various ways to solve different problems and meet specific requirements. There is one other building block known as a class construct in Object-Oriented Programming (OOP) which will be covered later in more advanced modules in your studies.
Numpy and Pandas as two popular python libraries are introduced here.
NumPy (Numerical Python) is a Python library for numerical computing, providing powerful tools for working with multi-dimensional arrays and matrices. It includes functions for mathematical operations, random number generation, linear algebra, Fourier analysis, and more. NumPy is widely used in scientific computing, data analysis, machine learning, and other fields where high-performance numerical operations are required. To work with Numpy, let's first install the library:
pip install numpy
Before working with Numpy, let's review some basic and useful concepts in linear algebra. Beginning with scalar, vector, matrix, and tensor data structures
Scalar: Single numerical value (rank 0), e.g., 2.0
Vector: An array of numbers (rank 1), e.g., [2.0, 3.0, 4.0]
Matrix: An array of numbers arranged in rows and columns (rank 2), e.g. a 2-row by 3-column (i.e., 2 x 3) matrix of zeros,
Tensor: A generalization of a matrix with an arbitrary rank, e.g., rank-1 tensor (or vector), rank-2 tensor (or matrix), rank-3 tensor, ...
import numpy as np # making sure that numpy is installed and imported properly; we usually use 'as' to use a library with a short nickname, since we need to repeat it so many times, hence using a nickname is easier
scalar_variable = 2.0 # scalar value
vector = np.array([1, 2, 3, 4]) # creating a one-dimensional array(i.e., vector) using a list; items of the list all should have the same type (in contrast to pythpn list)
print('type of array: ', type(vector)) # type of array
print('data type of array: ', vector.dtype) # data type of array
print('shape of an array: ', vector.shape) # shape of an array is a tuple with sizes of each dimension, number of dimensions is the 'rank' of an array; here rank is one and the size of the only dimension is 4
print('second item of the array: ', vector[1]) # second item of the array
print('-'*30)
matrix = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) # creating two dimension array(i.e., matrix) using a list
print('number of dimensions: ', matrix.ndim) # number of dimension
print('shape: ', matrix.shape)
print(matrix[1])
print('-'*30)
tensor = np.array([[[1, 2, 3, 4]], [[5, 6, 7, 8]]]) # creating three dimensional array using a list
print('shape: ', tensor.shape)
print(tensor[1])
print('-'*30)
vector = np.linspace(1, 5, 10) # constructing an array with 10 items equally spaced between 1, 5
print(vector)
print('-'*30)
array = [[1, 2, 3, 4]]
repeated_array = np.repeat(array, repeats=5, axis=0) # repeating the array, 5 times as rows (axis=0)
print('shape', repeated_array.shape)
print(repeated_array)
Int (integer) and Float with different precisions (8, 16, 32, 64 bit) are two common data types in numpy arrays. Uint (unsigned integer), bool, and complex are some other data types. Numpy arrays can have their data types explicitly declared or implicitly inferred.
import numpy as np
array = np.array([1, 2, 3, 4]) # inferred implicitly
print(array.dtype)
print('-'*30)
array = np.array([1.0, 2, 3, 4]) # inferred implicitly (only one float in an array changes the data type of the entire array to float)
print(array.dtype)
print('-'*30)
array = np.array([1, 2, 3, 4], dtype=np.int64) # declared explicitly
print(array.dtype)
print('-'*30)
array = array.astype(np.float32) # change dtype of an array
print(array.dtype)
print('-'*30)
The copy of an array is only a pointer to the original array (i.e. changing the copy version, changes the original array), unless the copy() function is expliceltly used used.
import numpy as np
array = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
sub_array = array[:, 2:3]
print(sub_array)
sub_array[0][0] = 10
print(sub_array)
print(array) # change a value in subarray changes the original array as well
print('-'*30)
array = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
sub_array = array[:, 2:3].copy() # making a separate copy
print(sub_array)
sub_array[0][0] = 10
print(sub_array)
print(array)
print('-'*30)
Inthe following block, several specific types of matrices are introduced.
zero/one/full matrix matrices with any size in which all elements are the same number, zeros, ones, or a predefined other value
Identity matrix (denoted with I) is a square matrix (i.e., has the same number of rows and columns) that has 1's along the main diagonal and 0's elsewhere. An identity matrix is similar to 1 in ordinary arithmetic; multiplying every matrix with an identity matrix (with compatible size) yields the same matrix.
Inverse matrix The inverse of a matrix(denoted with a superscript -1) is another square matrix that, when multiplied by the original matrix, yields the identity matrix of the same size. Note: a matrix is called singular if it has no inverse.
import numpy as np
np.set_printoptions(suppress=True) # round very small numbers to zero
zero_array = np.zeros(shape = (5, 2)) # 5 by 2 matrix of zeros
print('zero array ', zero_array)
print('-'*30)
zero_array = np.ones(shape = (5, 2)) # 5 by 2 matrix of ones
print('one array ', zero_array)
print('-'*30)
full_array = np.full(shape = (2, 2), fill_value=5) # 2 by 2 matrix of 5s
print('full array ', full_array)
print('-'*30)
full_array = np.full_like(zero_array, fill_value=5) # gets the shape of zero_array, but fills with 5
print('full array ', full_array)
print('-'*30)
identity_array = np.eye(N = 2) # 2 by 2 identity matrix
print('identity array ', identity_array)
print('-'*30)
random_array = np.random.random((2, 3)) # 2 by 3 matrix initialized with decimal random numbers
print('random array ', random_array)
print('-'*30)
random_array = np.random.randint(2, 10, (3, 3)) # 3 by 3 matrix initialized with integer random numbers between 2 and 10
print('random array ', random_array)
print('-'*30)
inverse_of_random_array = np.linalg.inv(random_array)
print('inverse array ', inverse_of_random_array)
print()
print(np.dot(random_array, inverse_of_random_array)) # an array multiplied with its inverse yields an identity matrix (more on matrix multiplication in the next block)
Numpy comes with a wide range of arithmetic and mathematical operations:
Element-wise operations vector/vector or matrix-matrix addition, subtraction, multiplication, and division are possible if both vectors or matrices have the same rank and size.
import numpy as np
vector = np.array([1, 2, 3, 4], dtype=np.float32) # use dtype argument to explicitly use a particular data type
matrix = np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=np.float32)
print('-'*20,' vecotr ', '-'*20)
print(vector)
print()
print('-'*20,' matrix ', '-'*20)
print(matrix)
print()
print('-'*20,' matrix matrix addition ', '-'*20)
print(np.add(matrix, matrix)) # element-wise addition, equivalent to matrix + matrix with overloaded +
print()
print('-'*20,' matrix vector addition ', '-'*20)
print(np.add(matrix, vector)) # although element-wise addition asserts that two matrices have the same dimension, Numpy uses broadcasting here, meaning that it constructs a matrix by stacking the same copy of the vector on the fly
print()
print('-'*20,' matrix matrix multiplication ', '-'*20)
print(np.multiply(matrix, matrix)) # element-wise multiplication, equivalent to matrix * matrix with overloaded *
print()
print('-'*20,' elementwise square root ', '-'*20)
print(np.sqrt(matrix)) # element-wise square root
print()
print('-'*20,' matrix max ', '-'*20)
print(np.max(matrix, axis=0)) # max of an array; on rows (axis=1), on columns (axis=0), or globally (without axis)
print()
Some other operations:
Internal/dot product is the sum over element-wise multiplication of two vectors. It yields a scalar.
Matrix multiplication Assuming A, B, and C are matrices. To compute the product C = A x B, the number of columns in A and the number of rows in B should be the same. The product (a matrix) will have the same row and same column as A, and B respectively.
C(m x n) = A(m x k) x B(k x n)
Transposition (denoted with a small superscript T) means to switch the rows and columns indices of a matrix. In matrix multiplication, sometimes we need to transpose one matrix to satisfy the condition for matrix multiplication as described above
import numpy as np
vector = np.array([1, 2, 3, 4], dtype=np.float32) # use dtype argument to explicitly use a particular data type
matrix = np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=np.float32)
print('-'*20,' inner product ', '-'*20)
print(np.dot(vector, vector)) # inner product of two vectors
print()
print('-'*20,' matrix vector multiplication ', '-'*20)
print(np.dot(matrix, vector)) # matrix vector multiplication, yields a vector
print()
print('-'*20,' matrix matrix multiplication ', '-'*20)
print(np.dot(matrix, np.transpose(matrix))) # matrix matrix multiplication, yields a matrix the second matrix above should be transposed before multiplication. Other shape manipulations are possible by the reshape function
print()
print('matrix shape', matrix.shape)
matrix_reshaped = matrix.reshape((4, 2))
print('matrix new shape', matrix_reshaped.shape)
print()
print('-'*20,' matrix matrix multiplication ', '-'*20)
print(np.matmul(matrix, np.transpose(matrix))) # another way to multiply two matrices
print()
print('-'*20,' identity matrix determinant ', '-'*20)
print(np.linalg.det(np.eye(3))) # computing the determinant of an identity matrix
print()
print('-'*20,' stacking vertically ', '-'*20)
print(np.vstack([vector, vector, vector, vector])) # stacking vectors or matrices vertically
print()
print('-'*20,' stacking horizontally ', '-'*20)
print(np.hstack([vector, vector, vector])) # stacking vectors or matrices horizontally
print()
Numpy provides several ways to index arrays, such as slicing, integer indexing and Boolean indexing.
import numpy as np
rank_2_array_1 = np.array([[1, 2, 3, 4], [1, 2, 3, 4]])
print(rank_2_array_1.shape)
rank_2_array_2 = rank_2_array_1[:2, -2:] # two indexes for two dimensions
print(rank_2_array_2.shape)
print(rank_2_array_2)
import numpy as np
rank_2_array_1 = np.array([[1.1, 2.2, 3.3, 4.4], [5.5, 6.6, 7.7, 8.8]])
print(rank_2_array_1.shape)
rank_2_array_2 = rank_2_array_1[[1, 0, 0, 1], [3, 1, 0, 2]] # [[index to rows], [index to columns]];e.g., 8.8 is the index 3 of columns from index 1 of rows
print(rank_2_array_2.shape)
print(rank_2_array_2)
Note: Integer indexing and slicing can be combined to construct lower ranked arrays:
import numpy as np
rank_2_array_1 = np.array([[1.1, 2.2, 3.3, 4.4], [5.5, 6.6, 7.7, 8.8]])
print(rank_2_array_1.shape)
rank_1_array_1 = rank_2_array_1[1, :] # gets all columns from the second row, so shape would be (4, )
print(rank_1_array_1.shape)
rank_1_array_2 = rank_2_array_1[:, 1] # gets all rows from the second column, so shape would be (2, )
print(rank_1_array_2.shape)
import numpy as np
distribution_1 = np.array([0.1, 0.49,0.87,0.59,0.52,0.42,0.34,0.7])
print(distribution_1.shape)
print('-'*30)
distribution_2 = distribution_1>0.5 # for each item it check this condition, returns an array with the same size but with Boolean values
print(distribution_2.shape)
print(distribution_2)
print('-'*30)
distribution_3 = distribution_1[distribution_1>0.5] # use the Boolean vector as an index, return values associated with the indices
print(distribution_3.shape)
print(distribution_3)
print('-'*30)
distribution_4 = distribution_1[(distribution_1>0.5) & (distribution_1<0.6)] # Boolean conditions can be complex
print(distribution_4.shape)
print(distribution_4)
To wrap up this section we do some image manipulations using numpy. Before that, we should install two other libraries:
pip install scikit-image
Scikit-image is a collection of algorithms for image processing. We use it to read an image into a three-dimensional array (number pixels in height, number of pixels in width, 3 values for Red, Green, and Blue). Check here for more info.
pip install matplotlib
For visualizing we use matplotlib. Check here for more info.
from skimage import io # import io module from skimage library
from matplotlib.pylab import plt # import pyplot module
image_matrix = io.imread('https://drive.google.com/uc?id=1Bsmk_7b4dBJPr1y-bQfLttYQPiMs-q_o') # call imread function to get the image matrix. An RGB image is a 3-dimensional (height, width, colors) mextix with values between 0-255 for each color
print(image_matrix.shape)
plt.imshow(image_matrix) # visualize the image
plt.imshow(image_matrix[100:500, 100:400]) # slicing
plt.imshow(image_matrix[:,:,::-1]) # all rows, all columns, but colors reversed
plt.imshow(image_matrix[::-1,:,:]) # all columns and colors, but rows reversed
plt.imshow(image_matrix[::2,:,:]) # all columns and colors, but every other rows
plt.imshow(image_matrix + 10) # elementwise addition
plt.imshow(np.where(image_matrix>200, 255, 0)) # use np.where to find the index of pixels whose values are greater than 200 and replace them with white (255), replace the rest with black (0)
Pandas is built on top of the NumPy library and provides high-level data structures and functions to manipulate data. The two main data structures in Pandas are the Series and the DataFrame.
A Series is a one-dimensional labeled array that can hold any data type, such as integers, floats, strings, or even Python objects.
import pandas as pd
series_1 = pd.Series(np.linspace(0, 10, 5), name = 'random_series') # instantiate a new series
print('series_1 shape ', series_1.shape)
print()
print('series_1 head \n', series_1.head(3)) # get the first few items in a series, head(n) to show n first items
print()
print('series_1 tail \n', series_1.tail(3)) # get the last few items in a series, tail(n) to show n last items
print()
print ('series_1 max ', np.max(series_1)) # apply mathematical operation on series
print()
print ('series_1 statistical description \n', series_1.describe()) # basic descriptive statistics on the values
print()
series_1.index = ['first', 'second', 'third', 'fourth', 'fifth'] # setting the index manually
print()
print('series_1 with index \n', series_1.head())
print()
print(series_1['fifth'])
A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types. It can be thought of as a spreadsheet.
import pandas as pd
# pandas has many methods for reading data into dataframe, including read_csv, read_excel, read_pickle.
# Here we use read csv. csv (comma separated values) are text files where each line is a row in which columns are (often) separated with a comma.
# if columns are separated with something else like tab ('\t'), argument delimiter='\t' should be added to read_csv function
df = pd.read_csv('https://query1.finance.yahoo.com/v7/finance/download/GOOG?period1=1582781719&period2=1614404119&interval=1d&events=history&includeAdjustedClose=true')
print(df.shape)
print(df.head())
new_df = df.iloc[:10] # use 'iloc' to get only the first 10 rows
print(list(new_df)) #list of columns
print(new_df.columns.values) #list of columns
print('-'*30)
print(new_df['Open']) # get only one column. new_df.Open is another way to do so as long as the column names do not contain space
print(new_df.Open)
print('-'*30)
print(new_df[['Close', 'Open']]) # particular columns in particular order
print('-'*30)
# dataframe indexing
print(new_df.iloc[:4, :2]) # iloc for integer indexing; only first 4 rows and first two columns
print('-'*30)
print(new_df.loc[new_df.index[1:5], ['Open', 'Close']] ) # loc for indexing using labels and integers
print('-'*30)
print(new_df.loc[(new_df.Open >66) & (new_df.Open<68)]) # loc for Boolean search
print('-'*30)
new_df = new_df[['Open', 'High', 'Low', 'Close']] # select only useful columns; other way is to drop undesired columns: new_df.drop(columns=['Date', 'Adj Close', 'Volume'])
print(new_df)
print('-'*30)
# new_df = new_df.drop(columns=['Date', 'Adj Close', 'Volume']) # same output as above
# print(new_df)
# print('-'*30)
new_df.columns = 'O', 'H', 'L', 'C' # changing column names
print(new_df)
print('-'*30)
new_df['O_C'] = new_df['O'] - new_df['C'] # add new column
print(new_df)
print('-'*30)
print(new_df.describe()) # basic descriptive statistics for the dataframe
print('-'*30)
new_df = new_df.sort_values('O') # sort dataframe rows based on Open values
print(new_df)
print('-'*30)
new_df.to_csv('updated_data.csv', index=False) # with all changes we did to the data now it is time to save the changes so the next time we have cleaned data. We can ignore writing indexes into the file since it is not required; index=False
import pandas as pd
# We can continue by loading the file that we saved earlier
new_df = pd.read_csv('updated_data.csv')
print(new_df)
print('-'*30)
new_df.reset_index(drop=True, inplace=True) # resetting index without creating new dataframe (inplace=True)
print(new_df)
print('-'*30)
for index, row in new_df.iterrows(): # iterating over rows
print(index, row.O)
print('-'*30)
new_df.plot(kind = 'scatter', x = 'O', y = 'C') # to visualizations within pandas