vsoch created a new tag, 0.2.28 at oras-project/oras-py
vsoch opened a pull request to converged-computing/state-machine-operator
vsoch pushed to conda-forge/oras-py-feedstock
updated v0.2.28 (#33)
vsoch pushed to sciworks/spack-updater
compilers are now packages
vsoch pushed to rseng/software
Merge pull request #416 from rseng/update/software-2025-03-30
Update from update/software-2025-03-30</small>
vsoch pushed to converged-computing/performance-study
Merge pull request #85 from converged-computing/azure-osu-reruns
osu re-runs - not a success</small>
vsoch pushed to converged-computing/state-machine-operator
wip: add support for workflow events (#27)
- wip: add support for workflow events
This will add support for ending the workflow early due to a count of successes, failures, or job duration metric. We need to next add ability to grow or shrink (need to think about how to do that, since we want a cloud agnostic solution) and then how to handle application specific metrics
Signed-off-by: vsoch vsoch@users.noreply.github.com
- feat: add support for minicluster
If we really want to test scale (shrink and grow) of a job and have it work with the cluster autoscaler, plus collecting metrics from an HPC app, we can most easily do that with the flux operator. This feature adds support for specifying a minicluster property to convert the previous indexed job into a MiniCluster. The flux operator needs to be installed.
Signed-off-by: vsoch vsoch@users.noreply.github.com
- feat: shrink with flux minicluster example working.
Signed-off-by: vsoch vsoch@users.noreply.github.com
- save state
Signed-off-by: vsoch vsoch@users.noreply.github.com
- feat: support for custom metrics
In this example, the user is allowed to provide a custom script that will be used against the log, and it needs to return a dictionary of values (the custom metrics). These are passed back to the manager from the state machine step and can influence workflow behavior (e.g., stop early, grow, or shrink.
Signed-off-by: vsoch vsoch@users.noreply.github.com
Signed-off-by: vsoch vsoch@users.noreply.github.com Co-authored-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to singularityhub/shpc-registry
Merge pull request #305 from singularityhub/update/containers-2025-03-27
[bot] update/containers-2025-03-27</small>
vsoch pushed to converged-computing/state-machine-operator
feat: shrink with flux minicluster example working.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue oras-project/oras-py#185.
I’d be happy to review a PR that adds this functionality then….
vsoch pushed to converged-computing/state-machine-operator
feat: analysis and plotting functions (#26)
- feat: analysis and plotting functions
- ensure x axis is same scale
- add analysis libfuncs
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue vsoch/oci-python#23.
Thanks! I remember this bit me for other projects, I appreciate the catch here….
vsoch pushed to converged-computing/state-machine-operator
feat: allow multiple node jobs
There is a bug in the kubernetes tracker that we treat the failed/succeeded as boolean (0/1) when it is actually a count of indices. We have not done experiments with >1 nodes so this has not been an issue (or caught). This change will fix it.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/performance-study
Merge pull request #84 from converged-computing/redo-osu
osu: fix runs for gpu 128 GKE and CE</small>
vsoch commented on issue oras-project/oras-py#185.
Is this supported for the oras client in Go?…
vsoch opened a pull request to spack/spack
vsoch commented on issue skypilot-org/skypilot#3777.
Closing for no interest….
vsoch pushed to rseng/software
Merge pull request #415 from rseng/update/software-2025-03-23
Update from update/software-2025-03-23</small>
vsoch pushed to flux-framework/spack
bug: cffi needs to be present for link (configure) (#308)
Signed-off-by: vsoch vsoch@users.noreply.github.com Co-authored-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue oras-project/oras-py#164.
Are we good to close here?…
vsoch pushed to converged-computing/state-machine-operator
feat: save kubernetes logs.
We have been saving artifacts for everything, relying on the application to take the burden of saving its own logging retrieved from the registry. For experiments with gpu selection we just need one little value, and I think it would be easier to save all the logs instead of using oras. This feature supports that, where the user adds a properties -> save-path, and under that path “logs” is created that is named by the job, step, and pod index.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to conda-forge/oras-py-feedstock
oras-py v0.2.27 (#32)
-
updated v0.2.27
-
MNT: Re-rendered with conda-build 25.1.2, conda-smithy 3.47.0, and conda-forge-pinning 2025.03.21.21.56.39</small>
vsoch pushed to singularityhub/shpc-registry
Merge pull request #304 from singularityhub/update/containers-2025-03-20
[bot] update/containers-2025-03-20</small>
vsoch pushed to flux-framework/spack
re-enable flux checks
vsoch pushed to converged-computing/state-machine-operator
bug: flux failed jobs do not have status COMPLETED, they are FAILED
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue pydicom/deid#275.
Closed with #276 …
vsoch commented on issue flux-framework/flux-core#6713.
I couldn’t say now - I wound up sending a kill signal to the job, and didn’t save the data because I considered the run erroneous!…
vsoch pushed to conda-forge/deid-feedstock
Merge pull request #48 from regro-cf-autotick-bot/0.4.1_h47750b
deid v0.4.1</small>
vsoch commented on issue pydicom/deid#276.
I can see the output above and the logic in the code, so no need. I think this is good to go - if you could please bump the version in version.py and add a corresponding note in the CHANGELOG.md we should be good….
vsoch pushed to converged-computing/state-machine-operator
feat: add more resource specs to flux tracker job submit
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch opened a pull request to spack/spack
vsoch merged a pull request to singularityhub/shpc-registry
vsoch commented on issue pydicom/deid#274.
Yes! Generally speaking you’d want to have a regular expression that matches that case here: https://github.com/pydicom/deid/blob/14d1e4eb70f2c9fda43fca411794be9d8a5a8516/deid/utils/actions.py#L32 and then throw an error when the particular name for the function is missing, or if the name is not found in “item.” I’d be happy to review a PR for that, and a test could go here….
vsoch pushed to flux-framework/spack
Update from update-package/flux-sched-2025-03-12 (#306)
-
Automated deployment to update package flux-sched 2025-03-12
-
Add 0.43 back
Co-authored-by: github-actions github-actions@users.noreply.github.com Co-authored-by: Vanessasaurus <814322+vsoch@users.noreply.github.com></small>
vsoch pushed to converged-computing/state-machine-operator
make more resilient to error
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/state-machine-operator
add support for oras arch for arm, etc. (#19)
- add support for oras arch for arm, etc.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch merged a pull request to converged-computing/state-machine-operator
vsoch pushed to converged-computing/state-machine-operator
values are always strings
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch reviewed a snakemake/snakemake-storage-plugin-gcs pull request
Looks good to me, as long as @johanneskoester gives a :+1: as well….
vsoch closed issue flux-framework/flux-python#9.
Potential issue with Flux 0.58 or the 0.57 python bindings installed with pip
Ran into some import issues with the latest flux 0.58 install (the public systems that are flux native at llnl), and the 0.57 python bindings from pypi: it seems the python bindings install isn’t quite finding the right things? Importing flux fails with a missing function in the c-python bindings (stack trace below, but doesn’t seem like that function’s a particular problme, just the first to get hit): …View Comment
vsoch commented on issue flux-framework/flux-python#13.
Please test / compare with the system flux, and look for differences. First, if the system level flux import doesn’t work, the issue is there. If there is a difference in the install structure, then we likely need an update to the logic. Let me know what you find….
vsoch pushed to converged-computing/state-machine-operator
nit: rename cores per task to cores_per_task
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch merged a pull request to converged-computing/state-machine-operator
vsoch commented on issue snakemake/snakemake-executor-plugin-googlebatch#57.
Sure, have fun!…
vsoch pushed to rseng/software
Merge pull request #413 from rseng/update/software-2025-03-09
Update from update/software-2025-03-09</small>
vsoch pushed to converged-computing/state-machine-operator
final tweak
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to flux-framework/spack
Fix missing hipBlas symbol (#49298)
Co-authored-by: Eric B. Chin chin23@llnl.gov Co-authored-by: Greg Becker becker33@llnl.gov</small>
vsoch pushed to converged-computing/state-machine-operator
bug: additional active jobs added
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/mummi-experiments
notes from meeting and workflow updates
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to singularityhub/shpc-registry
Merge pull request #302 from singularityhub/update/containers-2025-03-06
[bot] update/containers-2025-03-06</small>
vsoch pushed to flux-framework/flux-framework.github.io
Merge pull request #144 from flux-framework/release-docs-2025-03-06
Update from release-docs-2025-03-06</small>
vsoch opened a pull request to converged-computing/state-machine-operator
vsoch opened a pull request to spack/spack
vsoch commented on issue jbms/sphinx-immaterial#412.
Excellent, thank you!…
vsoch commented on issue jbms/sphinx-immaterial#412.
Thank you!…
vsoch pushed to flux-framework/spack
QtPackage: modify QT_ADDITIONAL_PACKAGES_PREFIX_PATH handling (#49297)
-
QtPackage: mv QT_ADDITIONAL_PACKAGES_PREFIX_PATH handling
-
geomodel: support Qt6
-
qt-base: rm import re</small>
vsoch pushed to flux-framework/flux-framework.github.io
Merge pull request #143 from flux-framework/release-docs-2025-03-05
Update from release-docs-2025-03-05</small>
vsoch commented on issue rootless-containers/usernetes#368.
Confirmed just now that increasing the uid range in those files fixes all the issues. Are there other options to that? I don’t think we could do that on a production system….
vsoch open issue jbms/sphinx-immaterial#412.
sphinx_immaterial.nav_adapt.MkdocsNavEntry object' has no attribute 'parent'
I haven’t built my docs for a while, and a new error has popped up! Does it look familar?…View Comment
vsoch commented on issue huggingface/gpu-fryer#3.
Gotcha, thanks for the speedy response! …
vsoch merged a pull request to flux-framework/flux-operator
vsoch pushed to converged-computing/aws-performance-study
add gpu-fryer note - only intended for single nodes
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue containerd/nerdctl#2020.
+1, this would be useful for me as well. I’m having a hard time with podman on our HPC clusters, and the underlying issue is the uid mappings (and how many there are in the rootless kind container, there would be no reasonable way to give those kinds of ranges to each user). But I think I could map specific ones to the user on the system but need this exposed. I was trying nerdctl and it failed with not being able to extract layers because of permissions….
vsoch commented on issue cloudmercato/ai-benchmark#2.
@Oil3 did you ever test this on more than one GPU or node? I ran the benchmark today on one node, one GPU and only one test didn’t run (a verison of keras too new) and I’m wondering if it can extend beyond that. From a quick glance it seems like maybe it would work on >1 GPU but not more than one node?…
vsoch pushed to flux-framework/Tutorials
Merge pull request #47 from flux-framework/hpcic-2024-tutorial-slides
hpcic 2024: adding tutorial slides</small>
vsoch pushed to converged-computing/flux-tutorials
Add notebook tutorial (#9)
- add notebook tutorial and ci
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/aws-performance-study
add gpu burn
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to vsoch/ai-benchmark
readme: tensorflow dependency update
The tensorflow-gpu library is deprecated (and no longer pip installable).</small>
vsoch pushed to rseng/software
Merge pull request #412 from rseng/update/software-2025-03-02
Update from update/software-2025-03-02</small>
vsoch pushed to flux-framework/spack
py-pymc3: not compatible with numpy 2 (#49225)
vsoch pushed to converged-computing/google-performance-study
add initial test mnist data here
This is being removed from flux-usernetes and I do not want to lose it
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/flux-usernetes
Merge pull request #23 from converged-computing/google-cloud-gpu-experiment
experiment: gke/usernetes on compute engine v100 1:1 gpu:node</small>
vsoch opened a pull request to converged-computing/flux-apps-helm
vsoch commented on issue cloudmercato/ai-benchmark#2.
It looks like the code no longer has this (although the pip install does) but maybe this would work?…
vsoch pushed to converged-computing/flux-usernetes
gke: sizes 4,8,16,32 pytorch mnist
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch opened a pull request to spack/spack
vsoch commented on issue flux-framework/spack#303.
This will be closed by https://github.com/spack/spack/pull/49230…
vsoch pushed to flux-framework/flux-go
Merge pull request #4 from flux-framework/add-python-grpc
feat: add python grpc service</small>
vsoch pushed to flux-framework/flux-framework.github.io
Merge pull request #142 from flux-framework/release-docs-2025-02-28
Update from release-docs-2025-02-28</small>
vsoch open issue converged-computing/state-machine-operator#6.
A few TODO for state machine operator
These are from my personal notes - not high priority so putting them here….View Comment
vsoch commented on issue singularityhub/shpc-registry#301.
I think we would want to make sure that the path is derived as simply the digest with sif. If you want to open a PR to work on it I’d be happy to review….
vsoch commented on issue spack/spack#49197.
I’m OK not being a maintainer here, but of course if you run into issues (I’m guessing this is for Mummi?) please come to me first!…
vsoch closed a pull request to flux-framework/spack
vsoch pushed to flux-framework/spack
py-sympy: add v1.13.1 (#48951)
- py-sympy: add v1.13.1</small>
vsoch commented on issue snakemake/snakemake-executor-plugin-googlebatch#56.
Is it possible something cached your directory state that needs to be reset / cleaned?…
vsoch commented on issue rootless-containers/usernetes#366.
oh wow, this is really interesting!…
vsoch pushed to researchapps/usernetes
ci: add test for rootful docker
This is important to run on multi-node
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue NeuroVault/neurovault_collection_downloader#3.
> I believe https://github.com/NeuroVault/pynv was the intended replacement. …
vsoch pushed to rseng/software
Merge pull request #411 from rseng/update/software-2025-02-23
Update from update/software-2025-02-23</small>
vsoch commented on issue converged-computing/performance-study#78.
This likely won’t be merged, but I’ll add the results (from when I ran them) for transparency. This thread is from December 15th 2024. …
vsoch commented on issue NVIDIA/nvidia-container-toolkit#56.
Figured it out….
vsoch pushed to vsoch/vsoch.github.io
rename AKS to Azure Kubernetes Service
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue rootless-containers/usernetes#242.
I got this fully working in rootless mode - I’ll put together a quick writeup soon….
vsoch pushed to converged-computing/flux-usernetes
gpu pytorch add dockerfile
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue rootless-containers/usernetes#242.
I made some progress in the 11 hours since I posted, and I think now the issue is in the space of nvidia. What is almost working is to set no-cgroups = true in the nvidia container runtime config.toml, but then there are issues with containerd on the kubelet. I posted more here: https://github.com/NVIDIA/nvidia-container-toolkit/issues/56#issuecomment-2673830806…
vsoch pushed to researchapps/usernetes
ci: add test for rootful docker
This is important to run on multi-node
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to hpc-social/jobs
run jobs updater once a day
vsoch commented on issue NVIDIA/nvidia-container-toolkit#85.
@elezar is CDI supported for Docker 28.0.0 now? I am having this specific issue (where I can’t use no-cgroups) and would like to test CDI - my setup is using rootless docker and docker compose….