vsoch commented on issue kubernetes-sigs/scheduler-plugins#722.
> 120 seems too much as a general default value IMO. Actually in additional to plugin-level config, it also honors PodGroup-level config, which can be specified in the PodGroup spec, and it takes precedence over the plugin-level one: …
vsoch commented on issue spack/spack#43331.
This is still failing almost all our builds and updates, almost every night, reliably. :cry: …
vsoch pushed to singularityhub/shpc-registry
Merge pull request #220 from singularityhub/update/containers-2024-04-18
[bot] update/containers-2024-04-18</small>
vsoch pushed to researchapps/kueue
Merge pull request #791 from kubernetes-sigs/dependabot/go_modules/github.com/kubeflow/common-0.4.7
Bump github.com/kubeflow/common from 0.4.6 to 0.4.7</small>
vsoch commented on issue kubernetes-sigs/kueue#2001.
hmm the rebase won’t work because it adds my user to all the previous commits. …
vsoch pushed to converged-computing/operator-experiments
experiment 10: complete run of all schedulers on same cluster
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue snakemake/snakemake-executor-plugin-googlebatch#29.
Ping @johanneskoester can you review again?…
vsoch pushed to researchapps/wg-image-compatibility
add subject to artifact
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to researchapps/kueue
review: aldo
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch open issue kubernetes-sigs/scheduler-plugins#722.
Rejection due to timeout / unreserve
Hi! I want to make sure I’m not doing anything wrong. I bring up a new cluster on GKE: …View Comment
vsoch commented on issue flux-framework/flux-k8s#74.
Failure due to controller entry point change from 5 days ago. Hopefully won
vsoch pushed to converged-computing/operator-experiments
kueue is working!
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to vsoch/vsoch.github.io
Update 2024-04-15-seeing-yourself.md
vsoch pushed to researchapps/flux-sched
qmanager: Preserve reservations across sched-loop iterations
Problem: Currently we remove all of the temporary reservations created in sched-fluxion-resource to backfill jobs. This has a side effect in that we can’t use that information as start-time estimates for pending jobs.
Preserve those temporary reservations across two consecutive schedule loop iterations.</small>
vsoch commented on issue flux-framework/flux-sched#1178.
Woot!! Ping @trws :green_circle: …
vsoch created a new branch, add-ssl at converged-computing/rainbow
vsoch pushed to singularityhub/shpc-registry
Automated deployment to update containers 2024-04-15 (#219)
Co-authored-by: github-actions github-actions@users.noreply.github.com</small>
vsoch pushed to researchapps/wg-image-compatibility
add use cases so brandon is happy
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue opencontainers/wg-image-compatibility#15.
To be clear for this comment: https://github.com/opencontainers/wg-image-compatibility/pull/15#discussion_r1555142734 …
vsoch commented on issue flux-framework/flux-python#11.
Sure thing, thanks for the notice! The flux versions are moving very quickly these days. Going to close the issue - please re-open or comment if something else comes up….
vsoch pushed to flux-framework/Tutorials
Merge pull request #36 from flux-framework/fix-flux-tree
hotfix: flux-tree stat -> stats</small>
vsoch pushed to converged-computing/ensemble-operator
test: adding testing setup (#13)
- test: adding testing setup
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch merged a pull request to converged-computing/ensemble-experiments
vsoch pushed to rseng/software
Merge pull request #370 from rseng/update/software-2024-04-14
Update from update/software-2024-04-14</small>
vsoch pushed to flux-framework/Tutorials
Removes old notebook (#35)
## What’s Changed
- gke: allow customization of autoscaling strategy by @vsoch in https://github.com/converged-computing/kubescaler/pull/21
Full Changelog: https://github.com/converged-computing/kubescaler/compare/0.0.19…0.0.2</small>View Comment
vsoch pushed to converged-computing/kubescaler
add support for node scale request
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/ensemble-operator
wip: testing different orderings (#12)
- wip: testing different orderings
Problem: randomize is limited for ordering Solution: change randomize “boolean” into an order variable that can take several forms. We will want to test this to see how order of jobs can impact an ensemble with autoscaling.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/ensemble-operator
preparing to run experiments
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/ensemble-experiments
Fix title of run3 plot
vsoch created a new branch, main at converged-computing/ensemble-experiments
vsoch opened a pull request to spack/spack
vsoch pushed to researchapps/flux-sched
test: init flywheel in tests and try fwinit
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to flux-framework/spack
Automated deployment to update flux-core versions 2024-04-13 (#176)
Signed-off-by: github-actions github-actions@users.noreply.github.com Co-authored-by: github-actions github-actions@users.noreply.github.com</small>
vsoch commented on issue flux-framework/flux-sched#1169.
There is also something called a key value flyweight, I wonder if we need to use that for some of the subsystem maps? https://www.boost.org/doc/libs/1_79_0/libs/flyweight/example/key_value.cpp. I also don’t know the difference between when they show: …
vsoch pushed to flux-framework/flux-framework.github.io
Merge pull request #113 from flux-framework/release-docs-2024-04-13
Update from release-docs-2024-04-13</small>
vsoch pushed to flux-framework/Tutorials
Cleanup and improve text for RIKEN tutorial (#34)
- Fixes some URLs
- Updates notebook “goals”
- Updates description of job throughput fig and adds proper figure numbering/captions
- Adds proper figure numbers and captions to module 3
- Adds fig numbers, captions, and footnotes to every figure in tutorial
- Updates refs to figures in text to correspond to figure numbering
- Updates verb tenses, fixes a DYAD figure, and adds a link to survey</small>
vsoch pushed to flux-framework/Tutorials
Merge pull request #32 from TauferLab/riken-dyad-hotfix
Hotfix for a bug in DyadTorchDataLoader</small>
vsoch pushed to researchapps/flux-sched
fix: convert strings to boost:flyweight
Problem: string comparison is immensely inefficient, taking 6-10% of resources (reported by Tom) for a trace. Solution: start with the root of the issue in the scoring module and work backwards to convert string types to boost:: flyweight. I tried to have a minimal footprint here but it spread really quickly
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch created a new branch, add-boost-flyweight at researchapps/flux-sched
vsoch opened a pull request to flux-framework/flux-sched
vsoch commented on issue flux-framework/flux-sched#1169.
Some tiny progress! Thanks to @milroy for seeing this. Here is the first failure to build (this is for the focal build) …
vsoch pushed to converged-computing/rainbow
Merge pull request #30 from converged-computing/add/devcontainer
dev: add vscode development container and docs for usage</small>
vsoch pushed to converged-computing/kubescaler
formatting
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to snakemake/snakemake-executor-plugin-googlebatch
fix: cos credentials only for settings
Problem: the COS container and credentials should be exposed in the executor settings. Solution: add them there.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to singularityhub/shpc-registry
Automated deployment to update containers 2024-04-11 (#218)
Co-authored-by: github-actions github-actions@users.noreply.github.com</small>
vsoch pushed to converged-computing/fluxterm
typo: ridiculous in README
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to converged-computing/fluxterm
fix: release workflow
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch opened a pull request to spack/spack
vsoch commented on issue snakemake/snakemake-executor-plugin-googlebatch#29.
huh, but if it works for you that’s great! Let’s get @johanneskoester to try it out for another test….
vsoch pushed to rseng/rsepedia-analysis
Merge pull request #116 from rseng/update/analysis-2024-04-10
Update from update/analysis-2024-04-10</small>
vsoch pushed to flux-framework/spack
Automated deployment to update flux-sched versions 2024-04-10 (#174)
Signed-off-by: github-actions github-actions@users.noreply.github.com Co-authored-by: github-actions github-actions@users.noreply.github.com</small>
vsoch pushed to converged-computing/ensemble-operator
adding experiment
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue snakemake/snakemake-executor-plugin-googlebatch#46.
Also if you are new to rebasing, I made a very dumb video a few years ago, haha. https://youtu.be/9F4RE2_yn6I …
vsoch pushed to sciworks/spack-updater
fix; add gpg init to spack
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to flux-framework/spack
test: buildcache for libsodium (#170)
- test: buildcache for libsodium
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch open issue flux-framework/flux-k8s#73.
Test using Active queue "Activate Siblings" vs Current approach
Coscheduling uses a strategy of moving siblings to the active Q when a pod that is about to hit a node hits the Permit endpoint. The strategy I have in place to schedule the first pod seems to be working OK, but I’d like to (after we merge the current PR) test this approach. I can see pros and cons to both ways - having to rely on another queue (subject to other issues) seems less ideal than having them all scheduled at the right time. On the other hand, if something might happen with the latter approach that warrants the active queue, maybe it makes sense. I think empirical testing can help us determine which strategy we like best (or even a combination of the two)….View Comment
vsoch open issue flux-framework/flux-k8s#71.
Update fluence to go 1.20 or 1.21
We are going to hit issues using fluence (go 1.19) with other integrations like rainbow (go 1.20) and on our systems (go 1.20), and after #69 should consider updating. …View Comment
vsoch commented on issue flux-framework/flux-core#5862.
> Additionally, if there are any other tips or workarounds for building and testing Flux within a Singularity container without sudo privileges, I’d like to hear them. …
vsoch merged a pull request to singularityhub/shpc-registry
vsoch pushed to converged-computing/rainbow
docs: separate early from current designs
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to rseng/software
Merge pull request #369 from rseng/update/software-2024-04-07
Update from update/software-2024-04-07</small>
vsoch opened a pull request to converged-computing/rainbow
vsoch pushed to snakemake/snakemake-executor-plugin-googlebatch
fix: make get_snakefile return rel path to snakefile (#40) (#41)
Should fix #39. Previously #40
Co-authored-by: Cade Mirchandani cmirchan@ucsc.edu</small>
vsoch commented on issue snakemake/snakemake-executor-plugin-googlebatch#29.
That
vsoch pushed to converged-computing/rainbow
Merge pull request #27 from converged-computing/add-state-endpoint
feat: state endpoint</small>
vsoch created a new branch, add-constraint-select-algorithm at converged-computing/rainbow
vsoch pushed to compspec/jobspec-go
fix: bug with omitemtpy->omitempty
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to compspec/jobspec
Merge pull request #10 from compspec/allow-parameter-properties
fix: nest parameter properties in their own section</small>
vsoch pushed to flux-framework/flux-k8s
test: only allow scheduling first pod
Problem: we currently allow any pod in the group to make the request Solution: Making a BIG assumption that might be wrong, I am adding logic that only allows scheduling (meaning going through PreFilter with AskFlux) given that we see the first pod in the listing. In practice this is the first index (e.g., index 0) which based on our sorting strategy (timestamp then name) I think might work. But I am not 100% on that. The reason we want to do that is so the nodes are chosen for the first pod, and then the group can quickly follow and be actually assigned. Before I did this I kept seeing huge delays in waiting for the queue to move (e.g., 5/6 pods Running and the last one waiting, and then kicking in much later like an old car) and I think with this tweak that is fixed. But this is my subjective evaluation. I am also adding in the hack script for deploying to gke, which requires a push instead of a kind load.
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch opened a pull request to converged-computing/ensemble-operator
vsoch pushed to rseng/devstories
tweak episode notes
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue oras-project/oras-py#129.
@my5cents looks like you just need one more run of black and we’re good (take note of the version)….
vsoch pushed to flux-framework/flux-operator
update gozmq example (#225)
- update gozmq example
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to flux-framework/flux-docs
Merge pull request #267 from flux-framework/add-architecture-slides
Add flux components slides</small>
vsoch commented on issue flux-framework/flux-docs#267.
Thanks @garlick ! I’ll get started on these changes and ping you when they are ready for a second review. I really appreciate it!…
vsoch commented on issue sustainable-computing-io/peaks#9.
Perfect, thank you! Is there a link to that somewhere (prominently) here?…
vsoch opened a pull request to spack/spack
vsoch pushed to rseng/rsepedia-analysis
Merge pull request #115 from rseng/update/analysis-2024-04-03
Update from update/analysis-2024-04-03</small>
vsoch pushed to flux-framework/spack
Merge pull request #166 from flux-framework/release/flux-core-v0.61.0
Update from release/flux-core-v0.61.0</small>
vsoch pushed to flux-framework/flux-framework.github.io
Merge pull request #111 from flux-framework/release-docs-2024-04-03
Update from release-docs-2024-04-03</small>
vsoch pushed to converged-computing/rainbow-experiments
add linear model with memory features
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch commented on issue sustainable-computing-io/peaks#9.
Hi! Is work still underway here? I am interested in the project idea but I don’t see any custom scheduler plugin code (is it somewhere else)? Thanks! …
vsoch commented on issue spack/spack-infrastructure#795.
Thank you!…
vsoch pushed to opencontainers/specs.opencontainers.org
Merge pull request #6 from Nicceboy/main
README: fix website url</small>
vsoch pushed to converged-computing/rainbow-experiments
add linear and log linear regression for spack builds
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch open issue converged-computing/rainbow#25.
The gnomes have things to say
I hate to disappoint, there are no cookies here. :cookie: :cookie: …View Comment
vsoch pushed to converged-computing/operator-experiments
fluence: first successful set of runs without clogging
Signed-off-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to singularityhub/shpc-registry
Automated deployment to update containers 2024-04-01 (#215)
Co-authored-by: github-actions github-actions@users.noreply.github.com</small>
vsoch pushed to converged-computing/rainbow-experiments
Add reliabuild version range experiments (#1)
- wip: running reliabuild experiments
Signed-off-by: vsoch vsoch@users.noreply.github.com
- refactor: match based on versions
Signed-off-by: vsoch vsoch@users.noreply.github.com
- wip: refactored reliabuild experiments
This experiment has become primarily about the build use case, and specifically package versions as the compatibility metadata. Since we cannot generate every perfect cluster, I think the result is going to be an example of how adding too much (in terms of requirements) leads to a poorer outcomes. Of course it is entirely based on the number of clusters I made, etc. I currently have only 100 clusters for about 10K jobs and over 200 dependency metadata (of course not every one is relevant for every package) so the odds of getting a matching cluster are pretty slim when you start asking for more detail. Hopefully folks can help to think of smarter experiments too.
Signed-off-by: vsoch vsoch@users.noreply.github.com
- results: add raw files in case I do something stupid
Signed-off-by: vsoch vsoch@users.noreply.github.com
- add results
Signed-off-by: vsoch vsoch@users.noreply.github.com
Signed-off-by: vsoch vsoch@users.noreply.github.com Co-authored-by: vsoch vsoch@users.noreply.github.com</small>
vsoch pushed to rseng/software
Merge pull request #368 from rseng/update/software-2024-03-31
Update from update/software-2024-03-31</small>
vsoch opened a pull request to converged-computing/rainbow-experiments
vsoch open issue converged-computing/rainbow#24.
Add robust logging library to control verbosity
As my experiments are getting very large, I am commenting out debugging messages so the terminal doesn’t explode. It would be better to use a logging library proper….View Comment
vsoch commented on issue singularityhub/singularity-cli#220.
I think it