বৈজ্ঞানিক কর্মপ্রবাহ পরিচালন ব্যবস্থা


30

কেউ কি আমাকে একটি ভাল ওয়ার্কফ্লো ম্যানেজমেন্ট সিস্টেম (ডাব্লুএমএস) সুপারিশ করতে পারে পাইথনে? এখনও অবধি আমি জিএনইউ মেক ব্যবহার করছি, তবে এটি জটিলতার একটি স্তরটির প্রবর্তন করে যা আমি এড়াতে চাই। একটি ভাল ডাব্লুএমএসের নিম্নলিখিত বৈশিষ্ট্যগুলি থাকা উচিত:

  • কমান্ড লাইন সরঞ্জাম এবং পাইথন স্ক্রিপ্টগুলির সাথে সহজেই সংহত করুন,
  • ব্যবহারে সহজ এবং হালকা ওজন,
  • নির্ভরতা হ্যান্ডেল,
  • কমান্ড লাইন ইন্টারফেস সরবরাহ,
  • লগিং প্রক্রিয়া সরবরাহ,
  • (alচ্ছিক) ডেটা প্রোভান্সেন্স সরবরাহ করে।

আমি জানি যে ডাব্লুএমএস বায়োইনফর্ম্যাটিকগুলিতে খুব জনপ্রিয় (উদাহরণস্বরূপ গ্যালাক্সি ) তবে আমি আরও সাধারণ কিছু সন্ধান করছি।


2
এটি একটি সম্পূর্ণ উত্তর নয়, তবে যেহেতু আপনি একই প্রশ্নে জিএনইউ মেক এবং পাইথনের উল্লেখ করেছেন, তাই ভেবেছিলাম আমি আপনাকে এসসিএসের দিকে নির্দেশ করব: scons.org
রেড। অ্যাচসন

ধন্যবাদ। বৈজ্ঞানিক কর্মপ্রবাহের জন্য SCons ব্যবহার করে এমন কোনও উদাহরণ সম্পর্কে আপনি কি জানেন?
বিটেল

আমি খুঁজে পেয়েছি যে সামান্য কাজ করে আপনি বেশিরভাগ ক্ষেত্রে ইম্যাক পেতে পারেন (কখনও কখনও বাইরের সরঞ্জামের সাথে একীকরণের মাধ্যমে)। এটি সম্ভবত আপনি যা খুঁজছেন তা নয়, যদিও যেহেতু আমি দেখতে পাই যে এখনও সাধারণত অযৌক্তিক কিছু সংকলনের জন্য আমাকে মেকফিলগুলি ব্যবহার করতে হয়।
ড্যান

1
আমি এসসিএস এবং ওয়াফ সম্পর্কে একটি উত্তর লিখতে পারি, এটি পাইথন বিল্ড সিস্টেম সরঞ্জাম। আমি এখন কয়েক মাসের জন্য স্কোন ব্যবহার করেছি এবং জিএনইউ মেকের সাথে সম্পর্কিত এটি সম্পর্কে ভাল এবং খারাপ সম্পর্কে কিছুটা দৃষ্টিভঙ্গি দিতে পারি। এটি বলেছিল, আমি ভাবছিলাম যে আপনি "লগিং মেকানিজম সরবরাহ করুন" এবং "ডেটা প্রভিন্সেন্স সরবরাহ করুন" বলতে কী বোঝাতে চেয়েছেন তা ব্যাখ্যা করতে পারলে। লগিংয়ের ক্ষেত্রে, আপনি কি কেবল একটি লগফিল চান, বা আপনি কোনও সংস্করণ নিয়ন্ত্রণ সিস্টেমের মতো আরও কিছু খুঁজছেন?
জেফ অক্সবেরি

1
Logging might be something very simple like logfiles, as you suggest, with timestamps of all runs, redirect of stderr and (optionally) stdout. In addition one could keep the intermediate results from each step in the workflow in a separate directory. Data provenance is something more like a control version system that keeps the history of all computation scripts, input and output files. Currently, I use Makefiles + git but I look for something better integrated and easier to use. I have heard of SCons but I do not know what is its advantage over Make.
btel

উত্তর:


12

For logging that allows full reproducibility, I highly recommend the Sumatra python package. It nicely links the version control commit number, machine state, and output files to each program run and has a django web interface to interact with the database of run info. The python API makes it very easy to include logging in my scripts.


Sumatra looks really interesting; I'll have to give it a try.
Geoff Oxberry

It does not fulfill all of my requirements, but it is closest to what I need. Therefore, I accepted the answer. Disclosure: I am one of developers of sumatra.
btel

8

Some month ago, I stumbled upon the highly recommended website of Hans-Martin v. Gaudecker who teaches courses like "Effective programming practices for economists". In his Autumn 2010 course he introduced SCons, in his Autumn 2011 course he switched to waf, which is supposed to be faster than SCons but still Python-based. The slides for both courses are available for download and I (as a social scientist) found them very instructive and enlightening.


1
SCons is pretty rad. It supports very complicated or very simple schemae equally well!
meawoppl

2
The tradeoff between SCons and any faster build tool generally has to do with dependency checking. For mainstream languages (C, C++, Fortran, D, Python, Java, etc.), SCons will automatically determine dependencies using an MD5 hash-based algorithm, rather than time stamps, which can be fragile when dealing with generated files. Everything else beats SCons in performance (time needed to build software) because they don't do as much dependency checking, or they offload the dependency checking to some other tool (like the compilers used).
Geoff Oxberry

1
The first link of your answer is 404 now. It seems that his new page is at uni-bonn.de/~hmg308/teaching.html
liori

SCons has configurable "up-to-dateness" checking, so you can choose between timestamp, hash, or some combination. That said, I'm growing disenchanted with it: A few things are very easy (e.g. compiling software using a tool chain for which SCons has good Tool packages) and almost anything is possible, but it gets ugly pretty quickly.
Eric Anderson

4

Take a look at VisTrails. I haven't used it (only homebrew stuff around make), but it looks well thought-out, with good doc, and has real users at NASA etc.
(Are you looking for tools for 1-2 people, 4-5, more ?)

Added: not quite your question, but I think worth repeating:
for uniform, reproducible computer experiments one obviously needs

  • uniform directory structures, e.g. when-what/ in/ out/ scripts/ log/
  • uniform setting and echoing of all parameters for a run
  • scripts to summarize / plot / evaluate runs.

See also software-carpentry.org: "The problem we’re trying to solve is that scientists often spend 40% or more of their time wrestling with software, but 95% or more of them are primarily self-taught".


4

All the requirements you mentioned in your question are fulfilled by the Swift parallel scripting system.

I've spent a year with Swift group as a postdoctoral researcher (PhD in scientific workflows). We've been helping scientists and researchers from different domains address their computational needs.

Swift is an open source framework for running workflows in parallel manner. It is called parallel scripting mainly to highlight the fact that it provides a scripting interface to creating workflows as opposed to the GUI box-arrow interface.

I can personally help you getting started and running your application with Swift. To know more about Swift, please take a look here.


Welcome to scicomp! Do you mind expanding your answer a little more (click the little gray edit button below your answer) to edit. Also, can you make your connection to Swift a little more clear in your answer? Thanks!
Aron Ahmadia

1

Taverna is an open-source WMS, not Python but Java.


Have you used it?
Deathbreath

Thanks for the suggestion. I saw the Taverna website, but it looks like a mainly graphical tool. I rather look for something command-line-based. Taverna does provide command-line tool, but it is only to execute workflows, but not build them (is it correct?). It also seems very much bioinformatics-oriented.
btel

It seems to me, you're more looking for a LIMS suitable for numerical experiments, rather than a build system like make or scons?
GertVdE

Sorry to ask. What does LIMS stand for exactly?
btel

1
Laboratory Information Management System. It's a family of tools to keep log of lab experiments. But these are typically, for example, for chemical analyses. You might want to Google for "in silico experiments", i.e. experiments that are simulations on a computer and require "logging" -> storing input/output data, what version of the software was used, hypotheses, ...
GertVdE


0

Dexy sounds like it is exactly what you area after. From the site:

Dexy is a multi-purpose project automation tool with lots of features designed for working with documents. Dexy is written in Python and has a command-line interface. It's open source software with an MIT license.

What does Dexy do?

Dexy makes it easier to create technical documents by doing the repetitive parts for you. Dexy provides a consistent interface to tools and scripts so you don't have to run them manually. Your project's dexy configuration keeps track of what to run, in which order, and with what parameters. This way, your whole process is captured so anyone can run it using one simple command and the results will be consistent.

You want to write a blog post with examples showing how to use an API. Dexy will automatically:

  • run your example code, saving the results
  • apply syntax highlighting to your example code (using pygments)
  • insert the results of API calls and your prettified example code into your post (using jinja)
  • convert your markdown-formatted blog post to HTML (using python markdown or pandoc).
  • upload the HTML to the WordPress API in draft mode (using the WordPress API)
  • publish your blog post when you are finished tweaking it

I've followed Dexy for a few years, and the impressions I've gotten are that it's not widely adopted, and it's not actively developed. These traits could be a chicken-and-egg problem (a small user base means it's not actively developed, not being actively developed hurts user base growth). It looks super cool, and on it's face, I think it's exactly what scientists need to broaden reproducibility beyond IPython, knitr, and bespoke scripts, but for some reason...it just doesn't seem like it gets used. Ana Nelson doesn't even blog that much about it, and she wrote it.
Geoff Oxberry

Well, the latest blob post is from January, and there have been 3 commits this year. Not super active, but not dead, especially if it's one of those project that just gets to stable and doesn't really need any more work. There are other project I use with much deader recent development histories. As for the chicken-and-egg problem, maybe an upvote here and wherever else it's mentioned on SE would help :P
naught101

আমাদের সাইট ব্যবহার করে, আপনি স্বীকার করেছেন যে আপনি আমাদের কুকি নীতি এবং গোপনীয়তা নীতিটি পড়েছেন এবং বুঝতে পেরেছেন ।
Licensed under cc by-sa 3.0 with attribution required.