LINUX Kernel Chapter 3 Introduction to the Kernel 黃仁竑

LINUX Kernel

Chapter 3

Introduction to the Kernel

黃仁竑

© 黃仁竑 / 中正資工

Processes and Tasks Processes

seen from outside: individual processes exist independently

Tasks seen from inside: only one operating system is

running

System Kernel with co-routines

Task 1 Task 2 Task 3

Process 1

Process 2

Process 3


Process States

Running

Return fromsystem call Interrupt routine System call

Ready Waiting

Interrupt

Scheduler

User mode

System mode


Process States Running

Task is active and running in the non-privileged user mode.

If an interrupt or system call occurs, it is switched to the privileged system mode.

Interrupt routine hardware signals an exception condition clock generates signal every 10 ms

System call software interrupt


Process States Waiting

wait for an external event (e.g., I/O complete)

Return from system call when system call or interrupt is complete scheduler switches the process to ready state

Ready competing for the processor


Important Data Structures Task structure

task_struct in include/linux/sched.h Also accessed by assembly code, cannot alter the

sequence or add declarations in the front states

TASK_RUNNING (0): ready or running TASK_INTERRUPTIBLE(1), TASK_UNINTERRUPTIBLE(2):

waiting for certain events. TASK_UNINTERRUPTIBLE means a task cannot accept any other signals.

TASK_ZOMBIE(3): process terminated but still has its task structure

TASK_STOPPED(4): process has been halted TASK_SWAPPING(5): not used.


Task Structurestruct task_struct {

/* these are hardcoded - don't touch */

volatile long state; volatile indicates that this value can be altered by i

nterrupt routines

long counter;

long priority; counter variable holds the time in ticks for the pro

cess can still run before a mandatory scheduling action is carried out. Counter is used as dynamic priority for scheduler

priority holds the static priority of a process


Task Structureunsigned long signal;

unsigned long blocked; signal contains a bit mask for signals received for

the process. It is evaluated in the routing ret_from_sys_call() which is called after every system call and after slow interrupts.

blocked contains a bit mask for signals to be blocked

unsigned long flags; flags contains the combination of the system status

flags


Task Structure Process flags:

#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */ /* Not implemented yet, only for 486*/

#define PF_PTRACED 0x00000010 /* set if ptrace (0) has been called. */#define PF_TRACESYS 0x00000020 /* tracing system calls */#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */#define PF_DUMPCORE 0x00000200 /* dumped core */#define PF_SIGNALED 0x00000400 /* killed by a signal */#define PF_STARTING 0x00000002 /* being created */#define PF_EXITING 0x00000004 /* getting shut down */#define PF_USEDFPU 0x00100000 /* Process used the FPU this quantum (SMP only) */#define PF_DTRACE 0x00200000 /* delayed trace (used on m68k) */


Task Structureint errno;int debugreg[8];

errno holds the error code for the last faulty system call.

debugreg contains the 80x86’s debugging registers.struct exec_domain *exec_domain;

which UNIX is emulated for each processstruct task_struct *next_task, *prev_task;

all processes are linked through these two pointers init_task points to the start and end of this list

struct task_struct *next_run, *prev_run; list of processes that apply for the processor


Task Structurestruct task_struct *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_os

ptr; pointers to (original) parent process, youngest child, youn

ger sibling, older sibling, respectively

parent

youngest child

child oldest child

p_pptrp_pptr

p_pptrp_cptr

p_ysptrp_ysptr

p_osptrp_osptr


Task Structurestruct mm_struct *mm;

memory management informationstruct mm_struct {

int count; pgd_t * pgd;unsigned long context;unsigned long start_code, end_code, start_data, end_data;unsigned long start_brk, brk, start_stack, start_mmap;unsigned long arg_start, arg_end, env_start, env_end;unsigned long rss, total_vm, locked_vm;unsigned long def_flags;struct vm_area_struct * mmap;struct vm_area_struct * mmap_avl;struct semaphore mmap_sem;

};


Virtual Memory


Task Structureunsigned long kernel_stack_page;

stack when a process is running in system modeunsigned long saved_kernel_stack;

save the old stack pointer when running MS-DOS emulator (vm86)

int pid, pgrp, session, leader; process id, group id, session belongs to, and session

leaderunsigned short uid,euid,suid,fsuid;unsigned short gid,egid,sgid,fsgid;

user id, effective user id, file system user id group id, effective group id, file system group id


Task Structure uid, euid, suid, gid, egid, sgid

Each process has a real user ID and group ID and an effective user ID and group ID.

The real ID identifies the person using the system The effective ID determines their access privileges. execve() changes the effective user or group ID to the

owner or group of the executed file if the file has the set-user-ID (suid) or set-group-ID (sgid) modes. The real UID and GID are not affected. The effective user ID and effective group ID of the new process image are saved as the saved set-user-ID and saved set-group-ID respectively, for use by setuid(3V).

Turn on suid: chmod a+s filename


Task Structure Uid, gid are inherited from parent euid, egid, fsuid, fsgid can be set at run time (owner o

f the executable file)int groups[NGROUPS];

A process may be assigned to many groupsstruct fs_struct *fs;

file system informationstruct fs_struct {

int count; /* for future expansions */unsigned short umask; /* access mode */struct inode * root, * pwd; /* root dir and current dir */

};


Task Structurestruct files_struct *files;

open file information (file descriptors)

struct files_struct { /* open file table structure */

int count;

fd_set close_on_exec; /* files to be closed when exec

is issued */

fd_set open_fds; /* open files (bitmask) */

struct file * fd[NR_OPEN];

};


Task Structurelong utime, stime, cutime, cstime, start_time;

time spend in user mode, system mode, total time of children process spend in user mode, system mode, and the time when the process generated, respectively.

unsigned long it_real_value, it_prof_value, it_virt_value;

unsigned long it_real_incr, it_prof_incr, it_virt_incr;

struct timer_list real_timer; timer for alarm system call (SIGALRM) time in ticks until the timer will be trigger, for re-i

nitialization, real-time interval timer, respectively.


Task Structurestruct sem_undo *semundo;

semaphores need to be released when a process terminated

struct sem_queue *semsleeping; semaphore waiting queue

struct wait_queue *wait_chldexit; When a process calls wait4(), it will halt until a chil

d process terminates at this queue.

struct rlimit rlim[RLIM_NLIMITS]; limits of the use of resources (setrlimit(), getrlimit

())


Task Structurestruct signal_struct *sig;

struct signal_struct {

int count;

struct sigaction action[32];

}; Signal handlers

int exit_code, exit_signal; return code and the signal that causes the program

aborted

char comm[16]; name of the program that executed by the process


Task Structureunsigned long personality;

description of the characteristics of this version of UNIX (see also exec_domain)

int dumpable:1; whether a memory dump is to be executed

int did_exec:1; is the process still running the old program (no exe

cve, …)

struct desc_struct *ldt; used by WINE, windows emulator


Task Structurestruct linux_binfmt *binfmt;

functions responsible for loading the programstruct thread_struct tss;

holds all the data on the current processor status at the time of the last transition from user mode to system mode, all registers are saved here.

struct thread_struct can be found in asm-i386/processor.h which, among other definitions, include 8086 related information:

struct vm86_struct * vm86_info; unsigned long screen_bitmap; unsigned long v86flags, v86mask, v86mode;


Task Structureunsigned long policy, rt_priority;

Scheduling policies: classic (SCHED_OTHER), real-time (SCHED_RR, SCHED_FIFO)

rt_priority :real-time priority#ifdef __SMP__

int processor;int last_processor;int lock_depth;

#endif When running on a multi-processor machine, need

to know on which processor the task is running, .., etc.


Process Tablestruct task_struct init_task;

points to the start of the doubly linked task list

struct task_struct *task[NR_TASKS]; task table

#define current (0+current_set[smp_processor_id()])

struct task_struct *current_set[NR_CPUS]; current process (for multi-processor architecture)

#define for_each_task(p) \

for (p = &init_task ; (p = p->next_task) != &init_task ; ) macro for find all processes the first task is skipped (init_task)


Files and inodes Two important structures:file, inode (linux

/fs.h) The file structure (process’s view)

struct file {

mode_t f_mode; acess mode when opened(RO, RW, WO)

loff_t f_pos; position of the read/write pointer (64-bit)

unsigned short f_flags; additional flag for controlling access rights (fcntl)


Files and inodes


Files and inodesunsigned short f_count;

reference count (dup, dup2, fork)struct file *f_next, *f_prev;

doubly linked list global variable: struct file *first_file;

struct inode * f_inode; actual description of the file

struct file_operations * f_op; refers to a structure of function pointers of file ope

rations, i.e., functions are not directly called. Since LINUX supports many file system, Virtual Fi

le System (VFS) is implemented.


Files and inodesstruct inode {

kdev_t i_dev; /* which device the file is on */unsigned longi_ino; /* position on the device */umode_t i_mode; nlink_t i_nlink;uid_t i_uid; /* owner user id */gid_t i_gid; /* owner group id */off_t i_size; /* size in bytes */time_t i_atime; /* time of last access */time_t i_mtime; /* time of last modification */time_t i_ctime; /* time of last modification to

inode*/


Memory Management Macros

#define __get_free_page(priority) __get_free_pages((priority),0,0)

#define __get_dma_pages(priority, order) __get_free_pages((priority),(order),1)

extern unsigned long __get_free_pages(int priority, unsigned long gfporder, int dma);

defined in linux/mm.h, page size is 4KB priority: GFP_BUFFER, GFP_ATOMIC, GFP_KER

NEL, GFP_NOBUFFER, GFP_NFS (what to do if not enough pages are free)

order:number of pages to be reserved (in power of 2) dma: address can be addressed by DMA component


Memory Management Functions

extern inline unsigned long get_free_page(int priority)

{

unsigned long page;

page = __get_free_page(priority);

if (page)

memset((void *) page, 0, PAGE_SIZE);

return page;

} Will clear the page


Memory Management Functions

void *kmalloc(size_t size, int priority)

void kfree(void *__ptr) malloc() and free() in the kernel


Waiting Queues Structures for waiting queues

struct wait_queue {struct task_struct * task;struct wait_queue * next;

}; include/linux/wait.h wait until condition met

Functions (sched.h) extern inline void add_wait_queue(struct wait_queue

** p, struct wait_queue * wait) extern inline void remove_wait_queue(struct wait_qu

eue ** p, struct wait_queue * wait)


Waiting Queues Functions

void sleep_on(struct wait_queue ** p);

void interruptible_sleep_on(struct wait_queue ** p);

void wake_up(struct wait_queue ** p);

void wake_up_interruptible(struct wait_queue ** p); kernel/sched.c sleep_on sets process state to TASK_UNINTERRU

PTIBLE or TASK_INTERRUPTIBLE wait_up sets process state to TASK_RUNNING


Semaphores Structure for semaphores

struct semaphore {

int count;

int waiting;

struct wait_queue * wait;

}; asm-i386/semaphore.h

Functionsextern inline void down(struct semaphore * sem)

extern inline void up(struct semaphore * sem)


System Time and Timers In unit of ticks (10 ms) Global variable, jiffies, denotes the time in ticks s

ince the system booted Structure for timer (old)

struct timer_struct {

unsigned long expires;

void (*fn)(void);

};

extern struct timer_struct timer_table[32];

extern unsigned long timer_active; /* which entry is valid? */


System Time and Timers Structure for timer (new)

struct timer_list {

struct timer_list *next;

struct timer_list *prev;

unsigned long expires;

unsigned long data; /* arguments */

void (*function)(unsigned long);

};

extern void add_timer(struct timer_list * timer);

extern int del_timer(struct timer_list * timer);


Process Management

Signal Interrupt Booting Timer Scheduler


Signal Signals ()

SIGHUP 1 hangupSIGINT 2 interruptSIGQUIT 3 quitSIGILL 4 illegal instructionSIGTRAP 5 trace trapSIGABRT 6 abort (generated by abort(3) routine)SIGIOT 6 Input/Output Trap (obsolete)SIGBUS 7 bus errorSIGFPE 8 arithmetic exceptionSIGKILL 9 kill (cannot be caught, blocked, or ignored)SIGUSR1 10 user-defined signal 1


SignalSIGSEGV 11 segmentation violation

SIGUSR2 12 user-defined signal 2

SIGPIPE 13 write on a pipe or other socket with no one to read it

SIGALRM 14 alarm clock

SIGTERM 15 software termination signal

SIGTKFLT 16

SIGCHLD 17 child status has changed

SIGCONT 18 continue after stop

SIGSTOP 19 stop (cannot be caught, blocked, or ignored)

SIGTSTP 20 stop signal generated from keyboard

SIGTTIN 21 background read attempted from control terminal


SignalSIGTTOU 22 background write attempted to control terminal

SIGURG 23 urgent condition present on socket

SIGXCPU 24 cpu time limit exceeded (see getrlimit(2))

SIGXFSZ 25 file size limit exceeded (see getrlimit(2))

SIGVTALRM 26 virtual time alarm (see getitimer(2))

SIGPROF 27 profiling timer alarm (see getitimer(2))

SIGWINCH 28 window changed (see termio(4) and win(4S))

SIGIO 29 I/O is possible on a descriptor (see fcntl(2V))

SIGPOLL 29 SIGIO

SIGPWR 30 Power Failure (for UPS)

SIGUNUSED 31


Signal System Calls Important system calls

kill(int pid, int sig) sends the signal sig to a process or a group of processe

s If pid is greater than zero, the signal is sent to the proc

ess with the PID pid. If pid is zero, the signal is sent to the process group of

the current process. If pid is -1, the signal is sent to all processes, except th

e system processes and current process If pid is less than -1, the signal is sent to all process of

the process group -pid



kill(int pid, int sig) The real or effective user ID of the sending processing

must match the real or saved set-user ID of the receiving process, unless the effective user ID of the sending process is super-user.

A single exception is the signal SIGCONT, which requires the sending and receiving processes belong to the same session.

Errors:– EINVAL: invalid sig– ESRCH: process or process group does not exist– EPERM: no privileges



kill(int pid, int sig) Implementation

– linux/kernel/exit.c– sys_kill() -> send_sig(), kill_pg(), kill_proc() -> g

enerate()– see also force_sig(), kill_sl()– also called from ret_from_sys_call() -> do_signal

()->send_sig() ->handle_signal() (signal.c, 223) ->setup_frame() (160) ->regs->eip = sa->sa_handler (213)


sys_kill Linux/kernel/exit.c, line 318-339

322-323: If pid is zero, the signal is sent to the process group of the current process.

324-334: If pid is -1, the signal is sent to all processes, except the system processes (PID=0 or 1) and current process. “for_each_task” macro is defined in include/linux/sched.h, line 491. If count is zero, return error code ESRCH.

335-336:If pid is less than -1, the signal is sent to all process of the process group -pid.

338: If pid is greater than zero, the signal is sent to the process with the PID pid.


kill_pg Linux/kernel/exit.c, line 258-275.

264-265: sig must be in [1..32], pgrp (process group id) must be greater than zero

266-273: for each process, if its process group id is pgrp, then sends signal sig to it (send_sig). If success, send_sig will return zero.

274: if found=0, then no process has been found, return error ESRCH, else return zero.


kill_proc Linux/kernel/exit.c, line 301-312

305-306: sig must be in [1..32]. 307-310: if a process with pid is found, sends signal sig t

o it (send_sig) 311: if no process has been found, return error ESRCH


send_sig Linux/kernel/exit.c, line 73-101

75-76: p cannot be null and sig must less than or equal to 32

77: priv is privilege (0 for normal process, 1 for super user), SIGCONT can only send to process belongs to the same sessin

78-79: The real or effective user ID of the sending processing must match the real or saved set-user ID of the receiving process, unless the effective user ID of the sending process is super-user.

80: super user? 81: If none of above conditions is true, return error


send_sig 82-83: if sig=0, do nothing 84-88: if sig in the task struct is null (in zombie state), do

nothing 89-95: if sig is SIGKILL or SIGCONT, and the process i

s in state TASK_STOPPED, wake up the process and reset SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU signals.

96-97: if sig is SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU, reset SIGCONT.

99: actually generate the signal


generate Linux/kernel/exit.c, line 29-51

31: set up signal mask 32: action of the signal, sa=p->sig->action[sig-1] 39: if the signal is not blocked and the process is not traced 41: and if the handler of the signal is SIG_IGN (to be ignore

d) and the signal is not from state change of child process 42: then return immediately. 44-46: if the handler if SIG_DFL (default action) and the sig

nal is SIGCONT, SIGCHLD, SIGWINCH, SIGURG, then return immediately. (wake up has been done for SIGCONT)


generate 48: finally, set the signal 49-50: if the signal receiving process is interruptable and

the signal is not to be blocked, then wake up the process.


force_sig Linux/kernel/exit.c, line 57-70

force to send a signal to a process (cannot be ignored) 60: if the process is not in zombie state 61-62: set the signal and get the signal action struct 63: really set the signal 64: the signal cannot be blocked, so clear the bit in

p->blocked 65-66: if the handler is SIG_IGN, reset it to SIG_DFL 67-68: wake up the process if it is interruptible


kill_sl Linux/kernel/exit.c, line 282-299

sends a signal to the session leader 288-289: sig must be in [1..32]. Session must be greater t

han zero 290-297: for each process, checks to see if session id is e

qual to sess and the process is the session leader, then sends signal to the session leader (send_sig)

298: return error if no process is found



sigaction(int sig, struct sigaction *act, *oact) examine and change signal action ( 取代 signal()) act: new action, oact: old action (return)struct sigaction { __sighandler_t sa_handler; /* SIG_DFL, SIG_IGN, or … */ sigset_t sa_mask; /* signals to be blocked during execution of handler*/ unsigned long sa_flags; /* SA_ONSTACK: on sig stack SA_INTERRUPT: do not restart system on signal return SA_RESETHAND: reset handler to SIG_DFL when signal taken

SA_NOCLDSTOP: don’t send SIGCHLD on child stop */ void (*sa_restorer)(void);}



sigprocmask(int how, sigset_t *set, *oset) examine and change the calling process’s blocked signa

ls how

– SIG_BLOCK: add blocked signals to oset– SIG_UNBLOCK: unblock blocked signals from ose

t– SIG_SETMASK: reset blocked signals with set

SIGKILL, SIGSTOP cannot be blocked undefined for SIGFPE, SIGILL, SIGSEGV if they are b

locked when they are generated



sigpending(sigset_t *set) stores the set of signals that are blocked from delivery

and pending for the calling process in set.

ssetmask(int mask), sgetmask() set/get blocked singals of current process, obsolete by sigprocmask().

Sigsuspend(int restart, unsigned long oldmask, unsigned long newmask)

replaces the process’s signal mask with newmask and then suspends the process until delivery of a signal.


sys_sigaction Linux/kernel/signal.c, line 150-182

155-156: check signal number [1..32] 157: get the old sigaction (p) 158-170: if action (new setting) is not null, check if it can

be read. If yes, copy the content of action to new_sa 171-176: if oldaction is not null, stores the old sigaction

(p) to oldaction 177-180: replace sigaction with new_sa


sys_sigprocmask Linux/kernel/signal.c, line 29-60

34-52: if set (new mask) is not null, process set depends on how; SIG_BLOCK, add blocked signals to oset, SIG_UNBLOCK: unblock blocked signals from oset, SIG_SETMASK: reset blocked signals with set.

53-58: if oset is no null, copy old_set (current->blocked) to seet.


Sys_sigpending Linux/kernel/signal.c, line 80-88

stores signals pending but blocked into set. 84: check if set can be write 85-86: if yes, copy current blocked signals to set.


InterruptTo allow the hardware to communicate wit

h the operating systemSource files

arch/i386/kernel/irq.c include/asm-i386/irq.h

Interrupt handlers slow, fast, bad (irq.c, lines 142-172) build the interrupt handler first

line 114-136, irq.c line 200-243, irq.h (macros)


Interrupt Interrupt number

First set : 0-7 Second set: 8-15 0 for timer On SMP board (486 and above)

irq13 for Interprocessor interrupts irq16 for SMP reschedule

On a 386 irq13 for SIGFPE (unreliable) no irq16


Interrupt Slow interrupts (include/asm/irq.h, line 205-222)

206: build symbol table IRQ#_interrupt (see irq.c, 142) 208: SAVE_ALL, save all registers 209: ENTER_KERNEL, synchronization processors’ acc

ess to the kernel on a SMP board 210: ACK_FIRST (or SECOND), ack to the interrupt con

troller 211: increase intr_count (number of nested interrupts) 213-217: call do_IRQ(int irq, struct pt_regs *regs) (see ar

ch/i386/kernel/irq.c, line 343-364) 219: UNBLK_FIRST (or SECOND), inform interrupt co

ntroller that interrupts of this type can again be accepted


Do_IRQ()

struct irqaction * action = *(irq + irq_action);…while (action) {

do_random |= action->flags;action->handler(irq, action->dev_id, regs);action = action->next;

}


Data Structure

struct irqaction {

void (*handler)(int, void *, struct pt_regs *);

unsigned long flags;

unsigned long mask;

const char *name;

void *dev_id;struct irqaction *next;

};


Interrupt Slow interrupts (irq.h, line 205-222)

220: decrease intr_count 221: increase syscall_count 222: jump to routine ret_from_sys_call (never returned)

Fast Interrupts (irq.h, line 224-236) use SAVE_MOST and RESTORE_MOST instead of SA

VE_ALL and do not call ret_from_sys_call 229-230: call do_fast_IRQ(int irq) (see irq.c, line 371-39

3)

Bad interrupts (irq.h, line 237-243) Simply acks the interrupt (not installed)


Interrupt 流程 IDT[] -> interrupt[] (or fast_interrupt[], bad

_interrupt[]) IRQi_interrupt (or fast or bad) -> do_IRQ

()->irqaction[] irqaction[i]->handler -> jump to ret_from_s

ys_call jump to handle_bottom_half (if bh_mask &

bh_active) do_bottom_half -> bh_base[] -> bh_base[i]


Interrupt 相關程式工作 request_irq()->setup_x86_irq() ( 或由 init fn) setup_x86_irq 做兩件事 :

一是將 IDT[] 的 entry 設到 interrupt[] 二是將相關的 action 放到 irqaction[] 中。若 irq 是 sh

ared ，則 irqaction[] 放的是一個 list interrupt[], fast_interrupt[], bad_interrupt[] 是由 BUILDIRQ macro 建好的。都是 assembly code 。在 assembly code 中， interrupt 及 fast_interrupt 會去 call do_IRQ ， bad_interrupt 則不會。另 interrupt 在 call 完 do_IRQ 後，會 jump 到 ret_from_sys_call (fast_interrupt 不會 ) 。


Interrupt 相關程式工作 do_IRQ 會一一執行記錄在相對應的 irqac

tion[] 中的 action 的 handler當 jump 到 ret_from_sys_call 時，會檢查是否需要 jump 到 handle_bottom_half (bh_mask & bh_active) ，在 handle_bottom_half 的 assembly code 中會 call do_bottom_half 。在 do_bottom_half 中會將 bh_base[] 中的 function 叫出來執行。


Interrupt 相關程式工作 bh_base[] 是以 init_bh() 來設定的。就像 irq 是用 request_irq() 來設定一樣


bottom half 相關的 data structure bh_mask: 設為 1 表示已安裝了 botto

m half routine bh_active: 設為 1 表示 interrupt 發生，已處理完快速部份，等著執行 bottom half 部份。

bh_mask_count: 計算此 bottom half被 disable 幾次， 0 時表示沒有任何 nested 的 disable ( 即為 enable)

bh_base: bottom half routine


範例 interrupt部份 :

start_kernel() -> time_init() -> setup_x86_irq(0, &irq0) -> set_intr_gate()irq0->action=timer_interrupt

故 IDT[0] -> interrupt[0] -> do_IRQ -> timer_interrupt()->do_timer()


範例 bottom half:

start_kernel() -> sched_init() -> init_bh(TIMER_BH, timer_bh)

所以在 call 完 do_timer 後， jump 到 ret_from_sys_call -> handle_bottom_half -> do_bottom_half->bh_base[0]->timer_bh


Device Driver Example以 3Com 3C509 網路卡為例

Source 在 drivers/net/3c509.c 在 open 網路卡時 (el3_open(), line 347)

request_irq(dev->irq, &el3_interrupt, …) 356

發生 interrupt 時 el3_interrupt(…) 515 mark_bh(NET_BH) 548

那 NET_BH 在那裏 init 呢 ? net_dev_init() (net/core/dev.c) init_bh(NET_BH, net_bh); 1471


init_IRQ()arch/i386/kernel/irq.c536 ：系統啟動後， void init_IRQ(void) 這個 function 將 IR

Q 初始化。545~547 ： outb_p 和 outb 都是 output 一個 byte 到某一 port 。548~549 ：這個 for loop ，利用 set_intr_gate 來設定 bad_int

errupt array ， set_intr_gate 請參考 system.h 中 235-247 ；初始將指到 bad_interrupt[] ，表示我們尚未安裝 interrupt handler 。在 request_irq() 中，會依所要求的 flag ，再將此位置改指到 interrupt[] 或 fast_ interrupt[] 。

555~556 ： request_region() 在 apricot.c 中是個 function ，在 resource.c 中是個 macro ，這裡不討論。

557~558 ： setup_x86_irq() 用來建立 Interrupt Descriptor Table(IDT)


setup_x86_irq( )395 ： setup_x86_irq() 開始。401 ： p = irq_action + irq; irq_action 定義在 219 行為一個

有 16 個 NULL 的 struct 指標陣列的頭，而加上 irq 就是找到它是 0~15 中的哪一個 irq 。

402~417 ：這段程式碼來決定此 IRQ 是否可以 share ， fast和 bad interrupt 一定不能 share ，只有 slow interrupt 有可能發生 interrupt share ，在後面第七章會詳細討論，這裡並不討論。

426~432 ：如果此 IRQ 不能 share ，如果屬於 fast interrupt則設定到 fast_interrupt[] 中，否則就設定到 interrupt[] 中。

int request_irq()437~467 ：當有一個 device 要求系統給一個 IRQ ，則呼叫 re

quest_irq() ，這個 function 會根據 device 所要求的 IRQ號碼，將 IRQ 的 handler 設給 device 。


Request and Free IRQint request_irq()437~467 ：當有一個 device 要求系統給一個

IRQ ，則呼叫 request_irq() ，這個 function 會根據 device 所要求的 IRQ 號碼，將IRQ 的 handler 設給 device 。

void free_irq()469~495 ： free_irq() 和上一個 request_irq

() 這好相反，當要拿掉一個 device ，則呼叫 free_irq() 空出 IRQ 。


Boot Boot process

BIOS reads the first sector of the boot disk (floppy, hard disk,

…, according to the BIOS parameter setting) Load the boot sector (512 bytes), which will contain pro

gram code for loading the operating system kernel (e.g., Linux Loader, LILO), to 0x7C00 (arch/i386/boot/bootsect.s, 35) in real mode

boot sector ends with 0xAA55 Boot disk

Floppy: the first sector Hard disk: the first sector is the master boot record (MB

R)


Boot Sector and MBRJMP 0x03E

Disk parametersProgram code loading

the OS kernel0xAA55

0x000

0x003

0x03E

0x1FE

BootSector(Floppy)

Code for loading the boot sector of the active partitionPartition 1

Partition 2

Partition 3

Partition 4

0xAA55

0x000 0x1BE

0x1BE 0x010

0x1CE 0x010

0x1DE 0x010

0x1EE 0x010

0x1FE 0x002

MBR and extendedpartition table


MBR MBR

Four primary partitions only 4 partition entries Each entry is 16 bytes

Extended partition If more than 4 partitions are needed The first sector of extended partition is same as MBR The first partition entry is for the first logical drive The second partition entry points to the next logical

drive (MBR) The first sector of each primary or extended

partition contains a boot sector


Extended Partition MBR

MBR for extended partition

Code for loading the boot sector of the active partition

Logic Partition

Next Ext Partition

Not Used

Not Used

0xAA55


Structure of a Partition Entry

Boot

HD

SEC CYL

SYS

HD

SEC CYL

low byte high byte

low byte high byte

1

2

1

1

2

1

4

4

Boot flag: 0=not active, 0x80 activeBegin: head number

Begin: sector and cylinder number of boot sector

System code: 0x83 Linux, 0x82: swap, 0x05: extendEnd: head number

End: sector and cylinder number of boot sector

Relative sector number of start sector

Number of sectors in the partition


Active Partition Booting is carried out from the active

partition which is determined by the boot flag Operations of MBR

determine active partition load the boot sector of the active partition jump into the boot sector at offset 0


Boot Process Compressed Kernel size

Include/linux/config.h, DEF_SYSSIZE = 0x7F00 clicks = 508 KB. (1 click=16 bytes)

zImage is less than this size zImage’s source is arch/i386/boot/bootsect.s, it is l

oaded to 0x7C00 first, it is then moved to 0x90000 and jump to there to start execution.

Setup.s is then loaded to 0x90200 and kernel image is loaded to 0x10000 (64KB)

Setup.s moves the kernel from 0x10000 to 0x1000(4KB) to save memory and then enters the protected mode, jumps to 0x1000 (line 520-536)


Bootsect.c Line 59-69

Moves code from 0x7C00 (BOOTSEG) to 0x90000(INITSEG)

64-65: set si, di to zero rep: repeat 68 68: move word by word until cx=0 (initiali

ze to 256) 66: cld clears DF flag in EFLAG to 0 whic

h makes the move statement goes up (increases the address for data movement)


Boot Process

Uncompress Kernel The start point is at arch/i386/kernel/head.s It initializes the system and then calls

start_kernel So the system then runs from start_kernel()


Booting the System LILO loads the Linux kernel into memory

starts from “start:” in arch/i386/boot/setup.s setup.s is responsible for initializing the hardware,

asking the bios for memory/disk/other parameters, and putting them in memory 0x90000-0x901FF

520-521: switch to protected mode 534-536: jmp 0x1000, KERNEL_CS

jmpi 0x100000, KERNEL_CS for big kernels

Continues from startup_32 in arch/i386/kernel/head.s


Booting the System More sections of the hardware are initialized (pa

ging table, co-processor, interrupt descriptor table (idt), stack, environment, …)

219: calls the start_kernel() in init/main.c start_kernel(): all areas of the kernel are initialize

d and process 1 is created 794-852: more initializations 858: creates process 1 (kernel_thread(init, NULL,0))

– process 0 is an idle process, do nothing and runs when no other process needs CPU

– process 1 calls the init() and starts some daemons 868: process 0 enters an infinite idle loop


Booting the System Init() in init/main.c, lines 919-1020

927: bdflush is responsible for synchronization of the buffer cache contents with the file system

929: kswapd is the background pageout daemon (swaping)

937: setup initializes the file systems and mounts the root file system

986-991: connects to the console and open file descriptors 0, 1, 2 (console)

993-997: tries to execute one of the programs /etc/init, /bin/init, /sbin/init.

999-1003: if none of the three programs exists, executes /etc/rc


Booting the System Init() in init/main.c, lines 919-1020

1005-1018: enters an infinite loop in which a shell is started for users to login on the console.


Setitimer System Call int setitimer(int which, struct itimerval *value, *ovalue)

which: ITIMER_REAL: decrements n real time. A SIGALRM signal is

delivered when this timer expires. ITIMER_VIRTUAL: Decrements in process virtual time. It runs

only when the process is executing (not including system time). A SIGVTALRM is delivered when this timer expires.

ITIMER_PROF: Decrements both in process virtual time and when the system is running on behalf of the process. A SIGPROF signal is delivered when this timer expires. It is designed for profiling the execution of interpreted programs.

The itimerval struct has two fields: it_interval and it_value. If it_value is non-zero, it indicates the time to the next timer expiration. If it_interval is non-zero, it specifies a value to be used in reloading it_value when timer expires. Setting it_value to zero disables a timer. Setting it_interval to zero causes a timer to be disabled after its next expiration.


Related Codes ITIMER_REAL

Data structure: timer_head run_timer_list() it_real_fn() (itimer.c, 98, sched.h, 297)

ITIMER_VIRTUAL do_it_virt() (sched.c, 943)

ITIMER_PROF do_it_prof() (sched.c, 956)

Sys_setitimer -> _setitimer()-> add_timer() Itimer.c/115, sched.c/606


Timer Interrupt Important global variables

jiffies kernel/sched.c (96): unsigned long volatile jiffies=0; ticks (10ms) since the system was started up

xtime kernel/sched.c (47): volatile struct timeval xtime; actual time

Timer interrupt updates jiffies and make the bottom half active the bottom half is called later, after handling othe

r interrupts


Timer Interrupt


Timer Interrupt do_timer (kernel/sched.c, 1077-1095)

1079: increase jiffies 1080: increase lost_ticks (ticks since last called of the bot

tom half routine) 1081: mark the bottom half active (include/linux/ interrup

t.h) 1082-1083: increase lost_ticks_system if in kernel mode

(ticks spent in kernel mode since last called of the bottom half routing)

1084-1092: profile 1093-1094: mark timer queue handler active


Timer Interrupt Bottom half routines of the timer interrupt

timer_bh (kernel/sched.c, lines 1070-1075) 1072: updating the times, kernel/sched.c, lines 1054-1

068

– 1058: xchg gets the value of lost_ticks and reset it to zero in an atomic way.

– 1063: get lost_ticks_system and reset

– 1064: calculate system load (lines725-738)

– 1065: update the real time xtime (740-922, hw)

– 1066: update times of current process (977-1049) 1073, 1074: updating system wide timers (649-683)


Timer Interrupt update_process_times (977-1049)

981: user time = ticks - system time 983: decrease the time quota used by current process 984-987: if the time quota is used up, need to reschedule 988-992: kernel statistics 994: update current process’s times (924-975)

929-930: update process’s user and system times 932-940: check if the process has used up its CPU lim

itation (setrlimit for setting limit of resource usage). If exceeds soft limit, sends SIGXCPU. If exceeds hard quota, sends SIGKILL to kill the process.


Timer Interrupt update_process_times (977-1049)

994: update current process’s times (924-975) 947-953: update interval timers. When timers have

expired, sends SIGVTALRM. 960-966: update profile

run_timer_list (649-665) 654: check timer list to see which timer has expired 655-662: prepare to call timer handler

run_old_timers (667-683) check timer table (obsolete)


Scheduler Classes

Real-time (soft) Preemptive: rt_priority SCHED_FIFO

– a process runs until it relinquishes control or a process with higher rt_priority wishes to run

SCHED_RR– can be interrupted if its time slice has expired and

there are other processes with the same priority wishes to run (round robin with the same class)

Classic SCHED_OTHER


Scheduler Schedule() (kernel/sched.c, lines 283-407)

Called when system call (indirectly, sleep_on -> schedule) after slow_interrupt, ret_from_sys_call is called to ch

eck the need_resched flag timer interrupt will also set the need_resched flag

Major tasks routines need to be called regularly determine the process with highest priority make the process to be the current process



303-304: cannot be called within a nested interrupt 306-310: the bottom halves of the interrupt routines (time-u

ncritical). E.g., the timer interrupt. 312: routines registered to be run in scheduler (chap. 7) 318-321: if current process belongs to the SCHED_RR clas

s and its time slice has expired, move it to the end of run queue.

323-325: if current process is in TASK_INTERRUPTIBLE state and the signal it is waiting has arrived, make it runnable again

326-333: if current process is waiting for timeout and the timeout has expired, make it runnable again



334-335: the current process must wait for an event, remove it from the run queue

357-364: looks for the process with highest priority.. goodness(lines 235-281) return values

– -1000: don’t select this task– 0: out of time (no results)– +ve: the larger, the better 1000: real-time process

255-256: real-time process 265: simply use p->counter as its weight 277-278: a slight favor to the current process



367-370: all process’s counter is 0, re-calculate 386-401: have a new process become the current process,

do the context switch (switch_to()) switch_to() in include/asm-i386/system.h, lines 53-12

2 104-105: if next is the current task, do nothing 106-109: clears the TS-flag if the task we switched to

has used the math co-processor latest 111-112: switch to the next task 114-120: reloads the debug regs if necessary.


System Call 流程設定 IDT table

在 kernel_start() 中， call 了 trap_init() (arch/i386/kernel/traps.c, 322)

trap_init() 中將系統中的 trap 設好後，會call set_system_gate(0x80, &system_call) 。此時， IDT[0x80] 就會設為 system_call 。發生 trap 0x80 時，就會 call system_call 。

所以 system call 是以 int 0x80 指令引發。


設定各種 system call 以 fork() 為例，在 include/asm-i386/ unistd.h 的 272 行定義了 static inline _syscall0(int,fork)

而 _syscall0 定義在 174 行，它會將此指令 extend 成 int fork(void) { long __res; __asm__ volatile ("int $0x80" : "=a" (__res) : "0" (__NR_fork)); /* 就是 2 */ if (__res >= 0) return (type) __res; errno = -__res; return -1; }


Fork() System Call

所以它就是靠 int $0x80 造成 trap ，並傳入 input 參數 __NR_fork

output 參數 __res 。當 trap 發生時，就會到 system_call 的地方執行。


執行 system_call 這在 arch/i386/kernel/entry.s 的第 281 行。

在 290 行，利用所傳入的參數 (system call number) 查 sys_call_table[] 的 function 名字 ( 如 sys_fork) ，如果不是 null ，在檢查完 trace flag 後，就會在 304 行 call 這個 function( 如 sys_fork) 。

system call 完成後，就會到 322 行，這就是 ret_from_sys_call ，是 slow interrupt執行完也會到的地方。

Documents

LINUX Kernel Chapter 3 Introduction to the Kernel 黃仁竑