| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
| |
the CLUSTER_USERNAME variable was not being used when running from the
frontend.
|
| |
|
|
|
|
|
| |
since the /tmp directory is shared by everyone then some left over files
remain. when a different user attempts to remove or write to those files
but does not have permission it causes failures. this commit just uses
/tmp/$USER instead of using /tmp directly
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
| |
when a job is not yet in running state the program runs but the function
that lists the job's machines returns zero machines which can cause a
panic from division by 0 which is not a very helpful message. this
commit improves the error/warn messages when no machines are listed for
a job.
|
| |
|
|
|
|
|
|
|
| |
if we are unable to obtain the hostname of the local machine from the
/etc/hostname file or HOSTNAME env var then an empty string is used as
the hostname and a warning is shown instead of a panic. since the
hostname is just used to determine if we are executing on the cluster,
and the cluster machines all have their hostnames set, then using the
empty string should not be a problem.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
some docker networks and the swarm can create routing rules that
conflict with the ones we create leading to errors such as:
59:2025-10-15T15:07:33.745853Z WARN Failed to connect to 10.0.1.243:4000: No route to host (os error 113)
89:2025-10-15T15:08:30.197646Z WARN Failed to connect to 10.0.1.32:4000: No route to host (os error 113)
92:2025-10-15T15:08:30.837360Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
95:2025-10-15T15:08:33.905356Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
98:2025-10-15T15:08:36.981419Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
101:2025-10-15T15:08:40.049335Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
104:2025-10-15T15:08:43.121680Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
107:2025-10-15T15:08:46.197394Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
110:2025-10-15T15:08:49.265514Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
113:2025-10-15T15:08:52.337454Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
116:2025-10-15T15:08:55.409444Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
119:2025-10-15T15:08:58.481471Z WARN Failed to connect to 10.0.1.91:4000: No route to host (os error 113)
1
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
|
|
|
|
| |
the container image is now only pulled once, before all containers are
created. this prevents hitting the container registry with thousands of
requests in a very short amount of time.
|
| |
|
|
|
|
| |
I got to the point of needed more than 10k rows and since my biggest
latency matrix is 10k in size this option allows the values to wrap so
we can create a network bigger than the matrix.
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the default value on the machines seems to be 262144 but on some larger
experiments dmesg will sometimes show the following logs:
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
[Fri Aug 8 05:01:42 2025] TCP: too many orphaned sockets
hopefully increasing this limit will fix that.
https://serverfault.com/questions/624911/what-does-tcp-too-many-orphaned-sockets-mean
the second answer on server faul also says it could be due to tcp memory
limits:
```
The possible cause of this error is system run out of socket memory.Either you need to increase the socket memory(net.ipv4.tcp_mem) or find out the cause of memory consumption
[root@test ~]# cat /proc/sys/net/ipv4/tcp_mem
362688 483584 725376
So here in my system you can see 725376(pages)*4096=2971140096bytes/1024*1024=708 megabyte
So this 708 megabyte of memory is used by application for sending and receiving data as well as utilized by my loopback interface.If at any stage this value reached no further socket can be made until this memory is released from the application which are holding socket open which you can determine using netstat -antulp.
```
but for now I will just increase the max orphans and see if that is
enough.
|
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
| |
dmesg was showing this messages:
[Thu Aug 7 14:05:26 2025] net_ratelimit: 4328 callbacks suppressed
[Thu Aug 7 14:05:26 2025] neighbour: arp_cache: neighbor table overflow!
[Thu Aug 7 14:05:26 2025] neighbour: arp_cache: neighbor table overflow!
[Thu Aug 7 14:05:26 2025] neighbour: arp_cache: neighbor table overflow!
[Thu Aug 7 14:05:26 2025] neighbour: arp_cache: neighbor table overflow!
and the machines were becoming inaccessible. increase the arp cache size
fixes this.
|
| |
|
|
|
|
|
|
|
| |
we were hitting conntrack limits when opening lots of connections and
sending UDP packets to many different hosts. this resulted in TCP
packets getting dropped which would manifest itself as errors when
connecting or timeouts and when sending UDP packets using `sendto` it
would fail with permission denied error. disabling conntrack fixes all
of these problems.
|
| |
|
|
|
|
| |
there were messages similar to:
HTB: quantum of class 10020 is small. Consider r2q change.
that showed up when brining up the network. this commit fixes that.
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
|
|
|
|
| |
the env var OAR_P2P_CONCURRENCY_LIMIT limits the number of parallel
"operations" being done on the cluster machines. so, if it is set to 3,
then we only work on 3 machines at time. setting to 0 means unlimited.
|
| | |
|
| |
|
|
|
|
|
| |
currently the shell script used to list 10.0.0.0/8 range of addresses on
a machine would fail with exit code 1 if no addresses were present in
that range (i.e. grep did not match anything). this fix just makes sure
that command always returns exit code 0.
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
| |
- Add generate-schedule.sh script to create container schedules from addresses.txt
- Add benchmark-startup Python script for analyzing container startup times
- Update demo.sh to print timestamps and wait for start signal at /oar-p2p/start
- Add comprehensive statistics including startup, start signal, and waiting times
- Support for synchronized container coordination via start signal file
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
| |
|
|
|
|
|
|
| |
Remove Rust-related files (Cargo.toml, Cargo.lock, src/, target/) and restructure as Python project using uv for dependency management. Update project structure to match nova-oar-mcp style with pyproject.toml, .python-version, and proper Python packaging conventions.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
|