oar-p2p - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	chore: bump version to 0.2.25HEAD v0.2.25 main	diogo464	2026-03-17	2	-2/+2
\|
*	fix: CLUSTER_USERNAME when running from frontend	diogo464	2026-03-17	1	-0/+4
\| \| \| \| \|	the CLUSTER_USERNAME variable was not being used when running from the frontend.
*	fix: namespace tmp directory	diogo464	2026-03-17	1	-12/+13
\| \| \| \| \| \| \|	since the /tmp directory is shared by everyone then some left over files remain. when a different user attempts to remove or write to those files but does not have permission it causes failures. this commit just uses /tmp/$USER instead of using /tmp directly
*	added default justfile recipe	diogo464	2026-03-17	1	-0/+3
\|
*	cargo fmt	diogo464	2026-02-11	3	-18/+17
\|
*	Merge pull request #1 from jbordalo/main	diogo464	2026-02-11	3	-2/+23
\|\ \| \| \| \|	Add support for differing username between cluster and local machine
\| *	Add support for differing username between cluster and local machine	jbordalo	2026-02-11	3	-2/+23
\|/
*	chore: bump version to 0.2.24v0.2.24	diogo464	2025-11-10	2	-2/+2
\|
*	feat: added --version flag	diogo464	2025-11-09	2	-0/+14
\|
*	fix: added --force flag to docker swarm leave	diogo464	2025-11-09	1	-1/+1
\|
*	chore: bump version to 0.2.23v0.2.23	diogo464	2025-10-30	2	-2/+2
\|
*	fixed clippy warnings	diogo464	2025-10-30	1	-15/+9
\|
*	warn if running from job machine	diogo464	2025-10-30	1	-2/+13
\|
*	added support for the shelder machine	diogo464	2025-10-30	1	-1/+1
\|
*	added support for snorlax machines	diogo464	2025-10-30	1	-1/+4
\|
*	chore: bump version to 0.2.22v0.2.22	diogo464	2025-10-30	2	-2/+2
\|
*	docs: add rust nightly documentation links	diogo464	2025-10-30	1	-2/+2
\| \| \| \| \| \| \| \|	Add links to Rust nightly documentation for installation prerequisites to help users understand how to install and configure the nightly toolchain. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
*	fix: compile warning	diogo464	2025-10-30	1	-0/+1
\|
*	fix: improve error message when job has 0 machines	diogo464	2025-10-30	2	-0/+11
\| \| \| \| \| \| \| \|	when a job is not yet in running state the program runs but the function that lists the job's machines returns zero machines which can cause a panic from division by 0 which is not a very helpful message. this commit improves the error/warn messages when no machines are listed for a job.
*	fix: dont panic when unable to obtain hostname	diogo464	2025-10-30	2	-9/+13
\| \| \| \| \| \| \| \| \|	if we are unable to obtain the hostname of the local machine from the /etc/hostname file or HOSTNAME env var then an empty string is used as the hostname and a warning is shown instead of a panic. since the hostname is just used to determine if we are executing on the cluster, and the cluster machines all have their hostnames set, then using the empty string should not be a problem.
*	chore: bump version to 0.2.21v0.2.21	diogo464	2025-10-17	2	-2/+2
\|
*	remove docker networks and leave swarm on setup	diogo464	2025-10-17	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	some docker networks and the swarm can create routing rules that conflict with the ones we create leading to errors such as: 59:2025-10-15T15:07:33.745853Z WARN Failed to connect to 10.0.1.243:4000: No route to host (os error 113) 89:2025-10-15T15:08:30.197646Z WARN Failed to connect to 10.0.1.32:4000: No route to host (os error 113) 92:2025-10-15T15:08:30.837360Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 95:2025-10-15T15:08:33.905356Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 98:2025-10-15T15:08:36.981419Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 101:2025-10-15T15:08:40.049335Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 104:2025-10-15T15:08:43.121680Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 107:2025-10-15T15:08:46.197394Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 110:2025-10-15T15:08:49.265514Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 113:2025-10-15T15:08:52.337454Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 116:2025-10-15T15:08:55.409444Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 119:2025-10-15T15:08:58.481471Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113) 1
*	increased ssh connection attempts to 10 and increased verbosity	diogo464	2025-10-14	1	-1/+2
\|
*	added extra debug logging when reading/parsing latency matrix	diogo464	2025-10-14	1	-0/+8
\|
*	fixed some README examples	diogo464	2025-09-18	1	-5/+9
\|
*	chore: bump version to 0.2.20v0.2.20	diogo464	2025-09-18	2	-2/+2
\|
*	feat: write timestamp to signal file	diogo464	2025-09-18	1	-4/+22
\|
*	fix: trim hostname to remove newlines	diogo464	2025-09-18	1	-0/+1
\|
*	remove existing containers on net up/down	diogo464	2025-08-23	1	-5/+10
\|
*	chore: bump version to 0.2.19v0.2.19	diogo464	2025-08-19	2	-2/+2
\|
*	fixed docker repeated docker pull	diogo464	2025-08-18	1	-2/+10
\|
*	chore: bump version to 0.2.18v0.2.18	diogo464	2025-08-18	2	-2/+2
\|
*	pull container image only once per machine	diogo464	2025-08-18	1	-1/+3
\| \| \| \| \| \|	the container image is now only pulled once, before all containers are created. this prevents hitting the container registry with thousands of requests in a very short amount of time.
*	chore: bump version to 0.2.17v0.2.17	diogo464	2025-08-18	2	-2/+2
\|
*	added the --matrix-wrap flag	diogo464	2025-08-18	1	-9/+24
\| \| \| \| \| \|	I got to the point of needed more than 10k rows and since my biggest latency matrix is 10k in size this option allows the values to wrap so we can create a network bigger than the matrix.
*	only consider running oar jobs when listing them	diogo464	2025-08-17	1	-1/+10
\|
*	added clean command	diogo464	2025-08-11	1	-0/+20
\|
*	chore: bump version to 0.2.16v0.2.16	diogo464	2025-08-08	2	-2/+2
\|
*	improved container wait reliability with timeouts/retries	diogo464	2025-08-08	1	-3/+9
\|
*	doubled tcp max orphan limit	diogo464	2025-08-08	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	the default value on the machines seems to be 262144 but on some larger experiments dmesg will sometimes show the following logs: [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets [Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets hopefully increasing this limit will fix that. https://serverfault.com/questions/624911/what-does-tcp-too-many-orphaned-sockets-mean the second answer on server faul also says it could be due to tcp memory limits: ``` The possible cause of this error is system run out of socket memory.Either you need to increase the socket memory(net.ipv4.tcp_mem) or find out the cause of memory consumption [root@test ~]# cat /proc/sys/net/ipv4/tcp_mem 362688 483584 725376 So here in my system you can see 725376(pages)4096=2971140096bytes/10241024=708 megabyte So this 708 megabyte of memory is used by application for sending and receiving data as well as utilized by my loopback interface.If at any stage this value reached no further socket can be made until this memory is released from the application which are holding socket open which you can determine using netstat -antulp. ``` but for now I will just increase the max orphans and see if that is enough.
*	only print last 500 lines of logs on container failure	diogo464	2025-08-08	1	-1/+1
\|
*	chore: bump version to 0.2.15v0.2.15	diogo464	2025-08-07	2	-2/+2
\|
*	fix: increase arp cache table size	diogo464	2025-08-07	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	dmesg was showing this messages: [Thu Aug 7 14:05:26 2025] net_ratelimit: 4328 callbacks suppressed [Thu Aug 7 14:05:26 2025] neighbour: arp_cache: neighbor table overflow! [Thu Aug 7 14:05:26 2025] neighbour: arp_cache: neighbor table overflow! [Thu Aug 7 14:05:26 2025] neighbour: arp_cache: neighbor table overflow! [Thu Aug 7 14:05:26 2025] neighbour: arp_cache: neighbor table overflow! and the machines were becoming inaccessible. increase the arp cache size fixes this.
*	chore: bump version to 0.2.14v0.2.14	diogo464	2025-08-07	2	-2/+2
\|
*	disabled conntrack on 10.0.0.0/8 packets	diogo464	2025-08-07	1	-0/+15
\| \| \| \| \| \| \| \| \|	we were hitting conntrack limits when opening lots of connections and sending UDP packets to many different hosts. this resulted in TCP packets getting dropped which would manifest itself as errors when connecting or timeouts and when sending UDP packets using `sendto` it would fail with permission denied error. disabling conntrack fixes all of these problems.
*	fixed dmesg logs from tc	diogo464	2025-08-07	1	-1/+1
\| \| \| \| \| \|	there were messages similar to: HTB: quantum of class 10020 is small. Consider r2q change. that showed up when brining up the network. this commit fixes that.
*	chore: bump version to 0.2.13v0.2.13	diogo464	2025-08-02	2	-2/+2
\|
*	added --interleave flag to oar-p2p net show	diogo464	2025-08-02	1	-2/+23
\|
*	chore: bump version to 0.2.12v0.2.12	diogo464	2025-07-24	2	-2/+2
\|
*	added basic retry logic to the machine_containers_wait function	diogo464	2025-07-24	1	-1/+17
\|