vsoch commented on issue rootless-containers/usernetes#242.
I made some progress in the 11 hours since I posted, and I think now the issue is in the space of nvidia. What is almost working is to set no-cgroups = true in the nvidia container runtime config.toml, but then there are issues with containerd on the kubelet. I posted more here: https://github.com/NVIDIA/nvidia-container-toolkit/issues/56#issuecomment-2673830806…
vsoch pushed to hpc-social/jobs
run jobs updater once a day
vsoch commented on issue NVIDIA/nvidia-container-toolkit#56.
@elezar it looks like both the PRs you mentioned have been merged - I’ve been trying to get GPUs working in userspace kubernetes, which means rootless docker running on the host, and then inside the container provided by rootless docker we have kubernetes components. I was able to set no-cgroups = true
in the host nvidia container runtime config and the container built and started, but then I get an error from the kubelet when trying to install / run the gpu operator via helm (and I suspect I’d hit this for other cases too):…
vsoch commented on issue singularityhub/singularity-hpc#688.
Thanks, @dipietrantonio. We can keep this open to follow up if a pin to jsonschema is needed….
vsoch commented on issue singularityhub/singularity-hpc#688.
shpc does not use rpds directly (and I’m not sure what it is). From the trace it looks like it’s required for jsonschema, so perhaps there was an update to that library that triggered this? I think the bug report would go there, and we might just pin jsonschema to a previously working version in the meantime (while it isn’t fixed)….
vsoch opened a pull request to converged-computing/flux-usernetes
vsoch closed a pull request to flux-framework/spack
vsoch pushed to flux-framework/spack
apptainer/singularity/singularityCE: variant suid default False (#49088)
vsoch pushed to flux-framework/spack
lis: add v2.0.28 -> v2.1.7 (#48308)
-
Added LIS 2.1.7
-
Added LIS versions from 2.0.28 to 2.1.7</small>
vsoch opened a pull request to converged-computing/flux-tutorials
vsoch pushed to rseng/software
Merge pull request #410 from rseng/update/software-2025-02-16
Update from update/software-2025-02-16</small>
vsoch pushed to flux-framework/flux-go
Merge pull request #3 from flux-framework/refactor
wip: refactor of flux-go</small>
vsoch opened a pull request to flux-framework/flux-go
vsoch pushed to converged-computing/state-machine-operator
Merge pull request #4 from converged-computing/refactor-generic
refactor: trackers are generic</small>
vsoch pushed to flux-framework/fluxion-go
bug: find can return > 0, should not be called an error
This is due to a call to check_array_size that returns an rc, but it is not actually a return code, it is the sum of elements in the array. The function can also return a negative value (when things go wrong) but it seems it can look like an error when it goes right and returns a positive, nonzero value. For the time being I am tweaking our reapi error parser to call anything >=0 not an error (nil)
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue eksctl-io/eksctl#8220.
Sounds good - thanks @bryantbiggs ! And it could be this allocation was special for us (I don’t know the details) and this issue is extremely unlikely to happen for anyone else. If that’s the case, then the custom build is probably our best bet, and we can pray to the GPU gods that circumstances change in the future to make them easier to get! :laughing: …
vsoch pushed to researchapps/eksctl
bug: placement group should not be added for reservation
Reservations are special cases that often can be created with placement already in mind, or have instances in different availability zones (or far enough away) so adding a placement group automatically will prevent provision of a large set of resources. Changing the default behavior to always require the user to specify a placement group for EFA is overkill, but a good balance is, in the case EFA is enabled and there is no placement group, when there is a reservation do not add the group automatically, but issue a warning to the user they can choose to respond to. TLDR: when a user deploys a cluster via a reservation they are responsible for adding the placement group.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue flux-framework/flux-coral2#278.
Would it be a man page or a wabbit page?
vsoch pushed to converged-computing/state-machine-operator
add gpu mummi example (working)
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/flux-tutorials
add link to google cloud
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue converged-computing/flux-apps-helm#14.
This is done….
vsoch created a new branch, grow-api-rebase at researchapps/flux-sched
vsoch pushed to flux-framework/fluxion-go
update to use merged flux-sched
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue flux-framework/flux-coral2#276.
badrabbit
?
vsoch commented on issue eksctl-io/eksctl#7949.
Are there any developers from AWS that want to talk about a fix? I can implement and open a PR, but I suspect that will go stale too…
vsoch pushed to flux-framework/spack
Bump up the version for rocm-6.3.2 release (#48787)
-
Bump up the version for rocm-6.3.2 release
-
rocm-openmp-extras update and style correction
-
Updating mivisionx, omniperf, rccl & rocprofiler-systems
-
Updating hipsparselt & rocm-opencl
-
rocprofiler-systems on gcc-13 and rvs commit instead of patch
-
Updated rocjpeg & rocm-examples for 6.3.2
-
ROCPROFSYS_BUILD_DYNINST & DYNINST_BUILD_TBB are required only with gcc-13
Co-authored-by: afzpatel <122491982+afzpatel@users.noreply.github.com></small>
vsoch commented on issue flux-framework/flux-sched#1335.
Good call on bidirectional graph! :grapes: …
vsoch pushed to converged-computing/usernetes-azure
pytorch: testing new model to scale on azure (#1)
- pytorch: testing new model to scale on azure
problem: the current resnet model is not scaling/running at a reasonable rate. This start of work will run 8 epochs on 1 or 2 nodes at reasonable times. We should work on it together to do final tweaks for the experiment. See the todo items in the README under the docker/pytorch directory. The container has been built and pushed.
Signed-off-by: vsoch vsoch@users.noreply.github.com
- finishing up testing
Signed-off-by: vsoch vsoch@users.noreply.github.com
Signed-off-by: vsoch vsoch@users.noreply.github.com Co-authored-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue singularityhub/sregistry#449.
It’s a nice idea, but a registry is not an image building service….
vsoch pushed to flux-framework/spack
Automated deployment to update flux-sched versions 2025-02-11 (#297)
Signed-off-by: github-actions github-actions@users.noreply.github.com Co-authored-by: github-actions github-actions@users.noreply.github.com</small>
vsoch pushed to flux-framework/flux-framework.github.io
Merge pull request #140 from flux-framework/release-docs-2025-02-11
Update from release-docs-2025-02-11</small>
vsoch pushed to converged-computing/fluxnetes-burst
docs: add note about make sync-external-ip
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue flux-framework/flux-sched#1335.
> I’m looking it over again now, might be worth a performance test on a big graph, just for fun… …
vsoch pushed to converged-computing/flux-apps-helm
Merge pull request #13 from converged-computing/add-kripke
app: add kripke, last one!</small>
vsoch opened a pull request to converged-computing/flux-apps-helm
vsoch created a new branch, grow-api-fix at researchapps/flux-sched
vsoch commented on issue hpc-social/jobs#28.
It should be OK to run locally - the Google sheet that is used for the update is grabbed progammatically via a csv url from Google sheets. What you might want to do is make a copy of the current _data/jobs.yaml
in case you want to restore or redo tests. Although the script (if I remember) will generate a previous jobs file in the same directory. The update will write to tmp and then use that file to replace the current jobs, so it will update….
vsoch pushed to flux-framework/fluxion-go
docker: update image bases
Problem: older bases do not have new enough gcc to build flux-sched.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/fluxion
partial cancel
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to researchapps/ramble
feat: update lammps reaxff to use newer builds
Problem: we want to run apptainer with lammps, and not require any spack or previous software installed. This adds an example, and also a new variable that exposes the version of lammps to download.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue pydicom/deid#240.
Thanks for the update!…
vsoch pushed to converged-computing/usernetes-azure
pytorch: testing new model to scale on azure
problem: the current resnet model is not scaling/running at a reasonable rate. This start of work will run 8 epochs on 1 or 2 nodes at reasonable times. We should work on it together to do final tweaks for the experiment. See the todo items in the README under the docker/pytorch directory. The container has been built and pushed.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue snakemake/snakemake-executor-plugin-googlebatch#55.
We likely could just remove the labels - I don’t think they are meaningfully used. Do you want to open a PR to test this out?…
vsoch pushed to researchapps/ramble
feat: update lammps reaxff to use newer builds
Problem: we want to run apptainer with lammps, and not require any spack or previous software installed. This adds an example, and also a new variable that exposes the version of lammps to download.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to flux-framework/spack
harfbuzz: add v10.2.0 (#48857)
vsoch commented on issue buildsi/ldaudit-yaml#4.
> I find your writing style
vsoch commented on issue GoogleCloudPlatform/ramble#869.
I’m also wondering if we can support a generic application executor. For example, I don’t need to hard code every specific argument for some application, I could just define a container, an input line with parameters, and then use ramble to programatically (and easily) run a lot of experiments without having to write an application.py for each. It’s a more flexible approach that is less prone to not having, for example, an updated download of data, and it would allow for using ramble to reproduce experiments without needing to write a lot of those complex files. What do you think?…
vsoch pushed to flux-framework/fluxion-go
test: improve testing setup for match and cancel
Problem: the current testing is not standard for Go, and makes it hard to understand or run in units. Solution: move testing into proper test alongside package, and break apart testing for cancel and match. Additionally, we are using the same graphs / jobspecs from flux-sched.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to flux-framework/flux-framework.github.io
Merge pull request #137 from flux-framework/release-docs-2025-02-05
Update from release-docs-2025-02-05</small>
vsoch pushed to flux-framework/spack
Tesseract v5.5.0 (#48866)
- leptonica: adding v1.85.0 Signed-off-by: Shane Nehring snehring@iastate.edu
- tesseract: adding v5.5.0 Signed-off-by: Shane Nehring snehring@iastate.edu
Signed-off-by: Shane Nehring snehring@iastate.edu</small>
vsoch open issue GoogleCloudPlatform/ramble#867.
Question: changing an input file
If the input `lammps-stage is defined as:…View Comment
vsoch pushed to singularityhub/shpc-registry
Merge pull request #298 from singularityhub/update/containers-2025-02-03
[bot] update/containers-2025-02-03</small>
vsoch open issue GoogleCloudPlatform/ramble#863.
Singularity as executable provider
Hi @douglasjacobsen! It looks like Singularity is hiding in ramble as (what I’m guessing is) a spack containerize output? I’m wondering if we might consider having a container provider that knows how to provision a container. E.g., right now I’m going to provide a container binary directly to internals->custom executables, but I wonder if there would be an easy way to define a URI alongside the config, and then have ramble check for Singularity and do the pull of the binary (just to the Singularity cache would be fine) if it doesn’t exist….View Comment
vsoch pushed to rseng/software
Merge pull request #408 from rseng/update/software-2025-02-02
Update from update/software-2025-02-02</small>
vsoch pushed to converged-computing/state-machine-operator
testing complete locally
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch opened a pull request to converged-computing/state-machine-operator
vsoch closed a pull request to conda-forge/deid-feedstock
vsoch commented on issue pydicom/deid#268.
Thank you for this work @sammaxwellxyz - I have finished up the PR in #273 with a slightly different design to import the pydicom.dcrmread
in one place that can be edited in the future. Please let me know if you run into any issues, and have a good weekend!…
vsoch opened a pull request to flux-framework/fluxion-go
vsoch commented on issue flux-framework/fluxion-go#16.
OK I went ahead and built with the grow-api branch, and made sure I had a fresh (not used yet) graph, and first I’m testing creating an allocation and then issuing a partial cancel with the entire graph (which I think should work)? …
vsoch pushed to flux-framework/flux-python
downgrade twine to 6.0.1
vsoch pushed to flux-framework/flux-python
downgrade mamba and hope builds 3.8
vsoch pushed to converged-computing/fluxqueue
wip for partial cancel
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch opened a pull request to converged-computing/fluxion
vsoch created a new branch, main at compspec/jgf-go
vsoch created a new repository, compspec/jgf-go at compspec/jgf-go
vsoch commented on issue flux-framework/flux-sched#1316.
@trws when you are back in business :briefcase: could you review this PR? And no rush if you are still on travel or otherwise - I can continue building from the PR branch….
vsoch pushed to converged-computing/fluxion
add errors endpoint
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to compspec/jobspec-go
feat: add constraints
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch created a new branch, add-constraint at compspec/jobspec-go
vsoch pushed to converged-computing/state-machine-operator
wip: add kubernetes operator (#1)
- wip: add kubernetes operator
- operator is fully working
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch opened a pull request to converged-computing/state-machine-operator
vsoch pushed to converged-computing/fluxion
wip: expose satisfy for fluxion
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to singularityhub/shpc-registry
Merge pull request #296 from singularityhub/update/containers-2025-01-27
[bot] update/containers-2025-01-27</small>
vsoch commented on issue pydicom/deid#268.
@sammaxwellxyz are you able and willing to update the PR with those changes?…
vsoch commented on issue pydicom/deid#268.
What you can do is remove or change the pins and then run tests, and fix any issues that arise….
vsoch pushed to converged-computing/state-machine-operator
release: 0.0.0 to coincide with first test demo
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to rseng/software
Merge pull request #407 from rseng/update/software-2025-01-26
Update from update/software-2025-01-26</small>
vsoch pushed to hpc-social/jobs
Revert “Use ISO-8601 date format (#26)” (#27)
This reverts commit 222e52a0bc2bb370d7a0f09310cac3b1a4258b42.</small>
vsoch commented on issue hpc-social/jobs#27.
@hainest you are many things - cat man, super student, good friend, but you are not dumb. You are most definitely not extra dumb!…
vsoch pushed to hainest/jobs
Merge branch ‘main’ into thaines/update_date_handling
vsoch pushed to singularityhub/shpc-registry
Merge pull request #295 from singularityhub/update/containers-2025-01-23
[bot] update/containers-2025-01-23</small>
vsoch closed issue fgmacedo/python-statemachine#518.
Question: example with dynamically defined states?
Hi! I stumbled on your library today, and it’s exactly what I’m looking for to drive a series of jobs in a workflow that will be run in Kubernetes. For most of the examples, it looks like the State
and relationships need to be defined in advance (and added as class attributes) and I’m wondering if you could point me to an example where this is done dynamically? E.g., ideally I could either define the class in a function (and return it) or somehow add the objects dynamically to the base StateMachine
. I started looking at that class and seeing if I could figure out how to do it, but this seems like it would be a common want so I wanted to ask here first. Thanks!…View Comment
vsoch created a new repository, converged-computing/usernetes-azure at converged-computing/usernetes-azure
vsoch pushed to converged-computing/flux-tutorials
add link to google in readme
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue apptainer/apptainer#2706.
Maybe sanity check that PYTHONPATH / PATH and other envars are exactly the same?…
vsoch pushed to vsoch/citelang
Automated deployment to update contributors 2025-01-20 (#56)
Co-authored-by: github-actions github-actions@users.noreply.github.com</small>
vsoch pushed to singularityhub/shpc-registry
Merge pull request #294 from singularityhub/update/containers-2025-01-20
[bot] update/containers-2025-01-20</small>
vsoch commented on issue oras-project/oras-py#181.
Since these are docs, I think a good strategy would be to leave the original example, but add the second line with a comment explaining the use case….
vsoch pushed to rseng/software
Merge pull request #406 from rseng/update/software-2025-01-19
Update from update/software-2025-01-19</small>
vsoch pushed to hpc-social/good-first-issues
unpin ruby/setup-ruby action version