Showing posts with label kernel. Show all posts
Showing posts with label kernel. Show all posts

Friday, May 7, 2021

Linux: What does IRQ save do in Linux?

 ExcerptUse local_irq_save to disable interrupts on the local processor and remember their previous state. The flags can be passed to local_irq_restore to restore the previous interrupt state.

void local_irq_save(unsigned long flags);
void local_irq_restore(unsigned long flags);

The spinlock version will disable interrupts on all the cores*


Thursday, April 29, 2021

Linux: Poking the ethernet driver with the ethtool

 1. Get basic information about the interface

[mylinuxbox@mylinuxbox-linux ~]$ ethtool -i eth4
driver: e1000e
version: 2.1.4-k
firmware-version: 0.13-4
bus-info: 0000:00:19.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no


2. Dump all the hardware registers
[root@mylinuxbox-linux mylinuxbox]# ethtool -d eth4
MAC Registers
-------------
0x00000: CTRL (Device control register)  0x18100240
      Endian mode (buffers):             little
      Link reset:                        normal
      Set link up:                       1
      Invert Loss-Of-Signal:             no
      Receive flow control:              enabled
      Transmit flow control:             enabled
      VLAN mode:                         disabled
      Auto speed detect:                 disabled
      Speed select:                      1000Mb/s
      Force speed:                       no
      Force duplex:                      no
0x00008: STATUS (Device status register) 0x00080083
      Duplex:                            full
      Link up:                           link config
      TBI mode:                          disabled
      Link speed:                        1000Mb/s
      Bus type:                          PCI
      Bus speed:                         33MHz
      Bus width:                         32-bit
0x00100: RCTL (Receive control register) 0x04008002
      Receiver:                          enabled
      Store bad packets:                 disabled
      Unicast promiscuous:               disabled
      Multicast promiscuous:             disabled
      Long packet:                       disabled
      Descriptor minimum threshold size: 1/2
      Broadcast accept mode:             accept
      VLAN filter:                       disabled
      Canonical form indicator:          disabled
      Discard pause frames:              filtered
      Pass MAC control frames:           don't pass
      Receive buffer size:               2048
0x02808: RDLEN (Receive desc length)     0x00001000
0x02810: RDH   (Receive desc head)       0x00000051
0x02818: RDT   (Receive desc tail)       0x00000040
0x02820: RDTR  (Receive delay timer)     0x00000000
0x00400: TCTL (Transmit ctrl register)   0x3003F0FA
      Transmitter:                       enabled
      Pad short packets:                 enabled
      Software XOFF Transmission:        disabled
      Re-transmit on late collision:     disabled
0x03808: TDLEN (Transmit desc length)    0x00001000
0x03810: TDH   (Transmit desc head)      0x0000007A
0x03818: TDT   (Transmit desc tail)      0x0000007A
0x03820: TIDV  (Transmit delay timer)    0x00000008
PHY type:                                unknown

Monday, April 26, 2021

Linux: Nagle's Algorithm and How to Disable it

 Nagle's algorithm is a TCP optimization in the kernel stack that waits to aggregate small chunks of bytes before sending packets on a TCP connection. This approach optimizes the amount of frame overhead spent in sending very small packets over the network. However, when the data is fairly sporadic, this could also lead to an increase in the average delay experienced.


Nagle's algorithm running on a host can be disabled by:
echo 1 > /proc/sys/net/ipv4/tcp_low_latency

Sunday, April 25, 2021

Linux: Kernel makefile system in brief

 There are multiple components in the kernel build system as shown in the table below:



Components in the Linux kernel build system

The Top makefile builds kernel image and modules. It does this by visiting directories recursively based on the kernel .config. The top Makefile textually includes an arch Makefile with the name arch/$(ARCH)/Makefile. The arch Makefile supplies architecture-specific information to the top Makefile. Each directory has a kbuild makefile which builds built-in or modular targets based on commands passed from above and info n the .config. Makefiles in the kernel are also referred to as kbuild files.

Further reading: A detailed discussion on these is available in the kernel documentation.

Thursday, June 4, 2020

Linux: Why is skb recycling done

Why is this done?
* Saves the cost of allocating and de-allocating memory repeatedly.
* Savings are significant because this is a very frequent operation (usually skb alloc and de-alloc is done on a per-packet basis).

Recent changes to SKB recycling:

"- Make skb recycling available to all drivers, without needing driver
  modifications.

- Allow recycling skbuffs in more cases, by having the recycle check
  in __kfree_skb() instead of in the ethernet driver transmit
  completion routine.  This also allows for example recycling locally
  destined skbuffs, instead of only recycling forwarded skbuffs as
  the transmit completion-time check does.

- Allow more consumers of skbuffs in the system use recycled skbuffs,
  and not just the rx refill process in the driver.

Wednesday, June 3, 2020

Linux: Kernel crash debugging: BUG: scheduling while atomic

Why do you see that print
"Scheduling while atomic" indicates that you've tried to sleep somewhere that you shouldn't - like within a spinlock-protected critical section or an interrupt handler.

Things to check:

1. In this case you should check if you are actually returning from some code that could cause the lock not to be released or actually sleeping in some part of the code.

2. Another error that may be spitted out during such a crash is :
BUG: workqueue leaked lock or atomic
This clearly indicates that you were not unlocking a certain lock, which could be typically caused by returning from a routine before the lock is released.

Saturday, May 2, 2020

Linux: Kernel Stack Corruption

What do you do when you see a crash trace like this one?

-- snip --
task: ffffffc07d4d8000 ti: ffffffc07d4d4000 task.ti
PC is at 0xc72kf
LR is at 0xc72kf
pc : [<00000000000c70c0>] lr : [<00000000000c70c0>]
sp : ffffffc07d4d7920
x29: ffffffc07d4d7720 x28: ffffffc07d4d7a58
x27: 0000000000000001 x26: ffffffc07d477780
x25: 0000000000000001 x24: ffffffc00088f000
--snip--
 [<00000000000c70c0>] (suspected corrupt symbol)

Saturday, April 18, 2020

Linux: Why do we need an idle task?

There are two main reasons for having the idle task in the Linux kernel design:
  • Historical reason: In the older days, CPUs were not capable of idling in a lower power state, so they would constantly execute the no-operation (nop[1]) instruction. Now, with the advances in the CPU design, most processors will leverage a variant of the halt (HLT[2]) instruction that will allow them to go to a lower power mode.
  • Convenience: Instead of coding up and handling a special case for what to do when none of the processes are runnable i.e. the run queue is empty, we instead schedule the idle task, which is always runnable.

Monday, April 6, 2020

Linux: No entries seen in /proc/modules on loading new kernel

Issue: No entries are seen in /proc/modules on loading a new kernel (typically built from scratch1).
Solution: The solution is pretty simple. This is happening because the kernel has not loaded the module for which you would like to see the symbols. You can double confirm this by running:
$lsmod  > modulelist.txt
$cat modulelist.txt
If you do not see your module here, set the following while rebuilding your kernel:
make LSMOD="modulelist.txt" localmodconfig
Using this, reboot into your new kernel and your module should get loaded.



  1. Quick ref on rebuilding kernel from source↩︎

Sunday, January 5, 2020

Linux: Why is the buddy system needed? - To prevent fragmentation

The buddy system is a mechanism for page management in Linux. It is needed to make sure that the free memory does not get fragmented and unusable. For an overview of the buddy system including a simple example of how it works, see this page [2]. From the same page, "In comparison to other simpler techniques such as dynamic allocation, the buddy memory system has little external fragmentation, and allows for compaction of memory with little overhead. The buddy method of freeing memory is fast, with the maximal number of compactions required equal to log2(highest order). Typically the buddy memory allocation system is implemented with the use of a binary tree to represent used or unused split memory blocks. The "buddy" of each block can be found with an exclusive OR of the block's address and the block's size."

An alternative to the buddy system would be to use the memory management unit (MMU) support to rewire or re-arrange blobs of free pages together to construct larger contiguous pages. However, this will not work for DMA systems which bypass the MMU. Also, modifying the virtual address on a continual basis would make the paging process slow.

Debugging on the buddy system can be done by printing the current stats. This is supported under the /proc/buddyinfo file. As described in the guide from centos.org, fragmentation issues can be debugged. A sample output from the same site is as shown below:
cat /proc/buddyinfo

Different ways to print Linux kernel symbols instead of addresses

Cheatsheet:

%pF versatile_init+0x0/0x110
%pf versatile_init
%pS versatile_init+0x0/0x110
%pSR versatile_init+0x9/0x110
(with __builtin_extract_return_addr() translation)
%ps versatile_init
%pB prev_fn_of_versatile_init+0x88/0x88

Monday, October 28, 2019

How to use IS_ERR and PTR_ERR? What do they mean?

From the kernel definition there are three macros:
  1. IS_ERR - used to check, Returns non-0 value if the ptr is an error. Otherwise 0 if it’s not an error.
  2. PTR_ERR - used to print. Current value of the pointer.
  3. IS_ERR_VALUE - is explained a little bit more detail here1.
I find this the most useful for kernel space programming. Used as follows- if ptr is the pointer you want to check then use it as follows:
if (IS_ERR(ptr))
     printk("Error here: %ld", PTR_ERR(ptr));

Tuesday, October 22, 2019

Linux: Solution Unknown symbol “__aeabi_ldivmod”

You might notice that while compiling for your 32bit platforms your kernel module compiles. However, when you are inserting it we see a failure (either at boot or while explicitly doing an insmod or modprobe).

The reason this "Unknown symbol in module" is seen is because you are in some instruction trying to do a 64bit division in the Linux kernel for an ARM platform (32bit).

Why is the symbol missing though if everything compiles. The compiler wants to do the 64bit div with slow library functions which Linux does not implement. So when the code is run, the symbol (__aeabi_ldivmod) is not found.

The solution to this problem is to use do_div() while including <asm/div64.h>.

Monday, October 21, 2019

Linux: Solution Fatal section header offset is bigger than file size

The error I was seeing while recompiling a driver:
fatal section header offset 32425246532452 in file 'vmlinux' is bigger than filesize=35524847

What helped was trying from scratch:
sudo make distclean
make menuconfig
make modules
sudo make modules_install

Friday, August 30, 2019

Which Linux Kernel timing or delay APIs to use for what?

I classify the waiting or timing API in the Linux kernel in two categories:
1. Which blocks the current thread of execution.
2. Something which has to be scheduled for later but we want the current thread to continue.
In most cases, the distinction between 1 and 2 is clear, but the techniques used to implement 2 can also be manipulated to behave like 1.


1. Blocking current thread of execution (Inline delays)
The API for 1. in the above case are:

Wednesday, August 28, 2019

Why Linux Kernel KASLR is not very effective

Recently, with more time on hand  I am reading about security in the Linux kernel. A common mode of attack on any program is using buffer overflow to implement return oriented programming (ROP) blobs. Return oriented programming is a mechanism of overwriting return addresses in a library to implement code blobs (or gadgets) that will perform the desired functionality.

Tuesday, June 25, 2019

Linux: How does the kernel invoke the correct driver for the corresponding /dev file ops?

To answer this question, I am picking snippets of texts from different blogs that I read online.
"When devfs is not being used, adding a new driver to the system means assigning a major number to it. The assignment should be made at driver (module) initialization by calling the following function, defined in <linux/fs.h>:
int register_chrdev(unsigned int major, const char *name,
struct file_operations *fops);
The return value indicates success or failure of the operation. A negative return code signals an error; a 0 or positive return code reports successful completion. The major argument is the major number being requested, name is the name of your device, which will appear in /proc/devices, and fops is the pointer to an array of function pointers" [1]
Thus, this will result in the registration of the device driver with a major number with the kernel. 
"Once the driver has been registered in the kernel table, its operations are associated with the given major number. Whenever an operation is performed on a character device file associated with that major number, the kernel finds and invokes the proper function  from the  file_operations structure. For this reason, the pointer passed to register_chrdev should point to a global structure within the driver, not to one local to the module’s initialization function."

Sunday, May 19, 2019

Linux: When to use kmalloc vs kmem_cache_alloc vs Vmalloc?

Read the quick comparison here: 
Linux Memory Management API Quick Primer  [Download PDF]
Gautam Bhanage | www.Bhanage.com | Pub2: GDB2019-001 June 2019

At a high level:
1. kmalloc for all generic memory allocation
2. kmem_cache_alloc for repeatetive structs that need allocations. These structures typically need to be accessed frequently (and are L1 & L2 cache aligned by the kernel).This is done much more efficiently through the slab allocator.
3. vmalloc() - allocates virtually contiguous memory. Not really useful from a linux kernel driver perspective.

Wednesday, April 24, 2019

Linux: Why Deleting Files Does Not Free Up Memory

This is a very old blog post from a place I contributed to:

You are in the middle of compiling your code and you see errors:
/disk/user1/platform_dev/driver/openwrt/staging_dir_arm_platform/bin/arm-unknown-linux-uclibcgnueabi-objcopy:/disk/user1/platform_dev/driver/base/build_platform/kmod/linuxmodule/stI4AuYi: No space left on device
make[5]: *** Waiting for unfinished jobs....

So this is just like old times. You run df -h and check if you are running at disk capacity.
I find that I am at 70% disk usage but still the make process is complaining.

I decide to be conservative and free up more space on the system.
Then I see something interesting. Despite doing the rm -Rf on a couple of big chunks of data,
I do not see the free space on my system going up. 

That gets me thinking why this might be happening. The reason is that the deletes I did were through screen. And, Linux will not free up any descriptors (Inodes) as long as they are being referred through some process.

Monday, February 20, 2017

What datastructure does the CFS use and why

CFS is the Linux kernels completely fair scheduler. It uses red black (RB) trees. 

Red Black trees - datastructure
Nice notes on understanding red-black trees are here [1] and [2].

Why and How is it used in the scheduler?