Supercomputer Utilities

The functions described here tend to be useful when navigating the labyrinth that is our supercomputer to find tomograms and (for now) their associated flagellar motor annotations. get_fm_tomogram_set is particularly useful. It parses the supercomputer for the tomograms and annotations that essentially make up the dataset for the Kaggle competition we are launching.

They are each imported directly to tomogram_datasets; in other words, to import get_fm_tomogram_set, one may simply call from tomogram_datasets import get_fm_tomogram_set.

Parsing the supercomputer for tomograms with/without annotations

The supercomputer directory "/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/Proteus_mirabilis" contains a number of tomograms, some of which have flagellar motor annotations, and some of which do not. The general structure is shown below.

/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/Proteus_mirabilis
├── PEET_FM                           # Unwanted directory
│   ├── Run10
│   │   ├── averagedFilenames.txt
│   │   ├── FM-001.log
│   │   ├── FM-002.log
│   │   ├── FM-003.log
│   │   ├── FM-004.log
│   │   ├── FM-005.log
│   │   ├── FM-006.log
│   │   ├── FM-007.log
│       ⋮
│       ⋮
│       ⋮
├── qya2015-11-19-12                  # Directory containing annotated tomogram
│   ├── atlas3_at20002.mrc            #   - Annotated tomogram
│   ├── atlas3_at20002_part121_54.rec 
│   └── FM.mod                        #   - Annotation
├── qya2015-11-19-16                  # Directory containing unannotated tomogram
│   ├── atlas40002.mrc                #   - Unannotated tomogram
│   ├── atlas40002_part121_50.rec
│   └── qya2015-11-19-16.id           
├── qya2015-11-19-2                   # Directory containing annotated tomogram
│   ├── atlas10003.mrc                #   - Annotated tomogram
│   ├── atlas10003_part121_20.rec
│   ├── Fm.mod                        #   - Annotation
│   ├── FM.csv                        
│   └── FMinitMOTL.csv                
│   ⋮ 
│   ⋮
│   ⋮

Say we want to find the .rec and .mod files associated with tomograms with and without flagellar motors. Some sub-directories, like qya2015-11-19-12, contain a .rec tomogram (atlas3_at20002_part121_54.rec) and an annotation (FM.mod. "FM" is an abbreviation for "flagellar motor"). Others don't contain an annotation, like qya2015-11-19-16, or they contain stuff we don't want, like "PEET_FM" below. All of the directories contain at least some files we don't want, like qya2015-11-19-12/atlas3_at20002.mrc. Here is one way to seek all .rec tomograms in /grphome/grp_tomo_db1_d2/.../Proteus_mirabilis and, if applicable, automatically pair them with respective annotations. The code uses utility functions described below on this page.

from tomogram_datasets import seek_dirs
from tomogram_datasets import seek_annotated_tomos
from tomogram_datasets import seek_unannotated_tomos

import re

root = "/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/Proteus_mirabilis"
# `dir_regex should match the target directories (directories containing a tomogram and an annotation) within `root`. Often each of these matches represents a "run".
dir_regex = re.compile(r"qya\d{4}.*") # The directories I care about all start with 'qya' and four digits
# `seek_dirs` returns a list of directories matching `dir_regex` within the provided root directory
directories = seek_dirs(root, dir_regex)

# flagellum_regex matches .mod files of the form `fm.mod`, ignoring case.
flagellum_regex = re.compile(r"^fm.mod$", re.IGNORECASE)
# tomogram_regex matches .rec files of the form `*.rec`.
tomogram_regex = re.compile(r".*\.rec$")

# Store tomograms with flagellar motor annotations in each of the directories
# targeted by `dir_regex` using the regexes defined above.
fm_tomograms = seek_annotated_tomos(
    directories, 
    tomogram_regex, 
    [flagellum_regex], 
    ["Flagellar Motor"]
)

# Store tomograms without flagellar motor annotations in each of the directories targeted by `dir_regex` using the regexes defined above.
no_fm_tomograms = seek_unannotated_tomos(
    directories, 
    tomogram_regex, 
    [flagellum_regex]
)

After running the above code, fm_tomograms should contain a list of 15 TomogramFiles, and no_fm_tomograms should contain a list of 5 TomogramFiles.

A collection of utilities for use on BYU's supercomputer.

SCTomogramSet

A class to manage the tomograms we work with on the supercomputer.

Source code in tomogram_datasets/supercomputer_utils.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
class SCTomogramSet():
    """ A class to manage the tomograms we work with on the supercomputer. """
    def __init__(self):
        self.tomograms = dict()
        self.private = dict()
    def __repr__(self):
        return f'<SCTomogramSet containing {len(self.tomograms)} tomograms>'
    def append(self, new_tomogram: TomogramFile, private: bool = True):
        """ Add a tomogram to the set. Assume it is private if `private` is not set. """
        label = _get_label(new_tomogram)
        # If the tomogram isn't present, add it
        if label not in self.tomograms:
            self.tomograms[label] = new_tomogram
            self.private[label] = private
        # Otherwise, combine its annotations with the existing tomogram's
        # annotations. 
        else:
            self.tomograms[label] = _combine_tomos(self.tomograms[label], new_tomogram)
            # If two matching tomograms have different privacy, make them both public
            if self.private[label] != private:
                self.private[label] = False

    def get_all_tomograms(self) -> List[TomogramFile]:
        """ Get all of the supercomputer tomograms. """
        return self.tomogram.values()
    def get_private_tomograms(self) -> List[TomogramFile]:
        """ Get all of the private (test) supercomputer tomograms. """
        requested_tomograms = []
        for label in self.tomograms:
            if self.private[label]:
                requested_tomograms.append(self.tomograms[label])
        return requested_tomograms
    def get_public_tomograms(self) -> List[TomogramFile]:
        """ Get all of the public (train) supercomputer tomograms. """
        requested_tomograms = []
        for label in self.tomograms:
            if not self.private[label]:
                requested_tomograms.append(self.tomograms[label])
        return requested_tomograms

    def get_annotated_public_tomograms(self):
        """ Get all public supercomputer tomograms that have annotations. """
        requested_tomograms = self.get_public_tomograms()
        return [tomo for tomo in requested_tomograms if tomo.has_annotation()]

    def get_unannotated_public_tomograms(self):
        """ Get all public supercomputer tomograms that have no annotations. """
        requested_tomograms = self.get_public_tomograms()
        return [tomo for tomo in requested_tomograms if not tomo.has_annotation()]

    def get_annotated_private_tomograms(self):
        """ Get all private supercomputer tomograms that have annotations. """
        requested_tomograms = self.get_private_tomograms()
        return [tomo for tomo in requested_tomograms if tomo.has_annotation()]

    def get_unannotated_private_tomograms(self):
        """ Get all private supercomputer tomograms that have no annotations. """
        requested_tomograms = self.get_private_tomograms()
        return [tomo for tomo in requested_tomograms if not tomo.has_annotation()]

append(new_tomogram, private=True)

Add a tomogram to the set. Assume it is private if private is not set.

Source code in tomogram_datasets/supercomputer_utils.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def append(self, new_tomogram: TomogramFile, private: bool = True):
    """ Add a tomogram to the set. Assume it is private if `private` is not set. """
    label = _get_label(new_tomogram)
    # If the tomogram isn't present, add it
    if label not in self.tomograms:
        self.tomograms[label] = new_tomogram
        self.private[label] = private
    # Otherwise, combine its annotations with the existing tomogram's
    # annotations. 
    else:
        self.tomograms[label] = _combine_tomos(self.tomograms[label], new_tomogram)
        # If two matching tomograms have different privacy, make them both public
        if self.private[label] != private:
            self.private[label] = False

get_all_tomograms()

Get all of the supercomputer tomograms.

Source code in tomogram_datasets/supercomputer_utils.py
63
64
65
def get_all_tomograms(self) -> List[TomogramFile]:
    """ Get all of the supercomputer tomograms. """
    return self.tomogram.values()

get_annotated_private_tomograms()

Get all private supercomputer tomograms that have annotations.

Source code in tomogram_datasets/supercomputer_utils.py
91
92
93
94
def get_annotated_private_tomograms(self):
    """ Get all private supercomputer tomograms that have annotations. """
    requested_tomograms = self.get_private_tomograms()
    return [tomo for tomo in requested_tomograms if tomo.has_annotation()]

get_annotated_public_tomograms()

Get all public supercomputer tomograms that have annotations.

Source code in tomogram_datasets/supercomputer_utils.py
81
82
83
84
def get_annotated_public_tomograms(self):
    """ Get all public supercomputer tomograms that have annotations. """
    requested_tomograms = self.get_public_tomograms()
    return [tomo for tomo in requested_tomograms if tomo.has_annotation()]

get_private_tomograms()

Get all of the private (test) supercomputer tomograms.

Source code in tomogram_datasets/supercomputer_utils.py
66
67
68
69
70
71
72
def get_private_tomograms(self) -> List[TomogramFile]:
    """ Get all of the private (test) supercomputer tomograms. """
    requested_tomograms = []
    for label in self.tomograms:
        if self.private[label]:
            requested_tomograms.append(self.tomograms[label])
    return requested_tomograms

get_public_tomograms()

Get all of the public (train) supercomputer tomograms.

Source code in tomogram_datasets/supercomputer_utils.py
73
74
75
76
77
78
79
def get_public_tomograms(self) -> List[TomogramFile]:
    """ Get all of the public (train) supercomputer tomograms. """
    requested_tomograms = []
    for label in self.tomograms:
        if not self.private[label]:
            requested_tomograms.append(self.tomograms[label])
    return requested_tomograms

get_unannotated_private_tomograms()

Get all private supercomputer tomograms that have no annotations.

Source code in tomogram_datasets/supercomputer_utils.py
96
97
98
99
def get_unannotated_private_tomograms(self):
    """ Get all private supercomputer tomograms that have no annotations. """
    requested_tomograms = self.get_private_tomograms()
    return [tomo for tomo in requested_tomograms if not tomo.has_annotation()]

get_unannotated_public_tomograms()

Get all public supercomputer tomograms that have no annotations.

Source code in tomogram_datasets/supercomputer_utils.py
86
87
88
89
def get_unannotated_public_tomograms(self):
    """ Get all public supercomputer tomograms that have no annotations. """
    requested_tomograms = self.get_public_tomograms()
    return [tomo for tomo in requested_tomograms if not tomo.has_annotation()]

_combine_tomos(tomo1, tomo2)

Combines two conceivably duplicate tomograms into one by merging their annotations. All other attributes are taken from tomo1, like filepath and such.

Source code in tomogram_datasets/supercomputer_utils.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def _combine_tomos(tomo1: TomogramFile, tomo2: TomogramFile) -> TomogramFile:
    """ 
    Combines two conceivably duplicate tomograms into one by merging their
    annotations. All other attributes are taken from `tomo1`, like filepath
    and such.
    """
    # Ensure that each tomogram has a list for its annotations, even if it is
    # empty
    if tomo1.annotations is None: 
        tomo1.annotations = []
    if tomo2.annotations is None: 
        tomo2.annotations = []
    # Combine annotations
    combined_annotations = tomo1.annotations + tomo2.annotations
    new_tomo = tomo1
    new_tomo.annotations = combined_annotations
    # Choose shortest filepath
    new_tomo.filepath = min(tomo1.filepath, tomo2.filepath, key=len)
    return new_tomo

_get_label(tomo)

Tomogram "labels" are the filename without path nor extension.

Source code in tomogram_datasets/supercomputer_utils.py
37
38
39
def _get_label(tomo: TomogramFile) -> str:
    """ Tomogram "labels" are the filename without path nor extension. """
    return os.path.splitext(os.path.basename(tomo.filepath))[0]

get_fm_tomogram_set()

Collect all tomograms that have been reviewed for flagellar motors from BYU's supercomputer into an SCTomogramSet.

From an SCTomogramSet tomo_set, get public tomograms with tomo_set.get_public_tomograms().

Does not initially load the tomogram image data. Given a Tomogram called tomo, one can load and access the image data in one step with tomo.get_data().

Returns:
Source code in tomogram_datasets/supercomputer_utils.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
def get_fm_tomogram_set() -> SCTomogramSet:
    """
    Collect all tomograms that have been reviewed for flagellar motors from
    BYU's supercomputer into an SCTomogramSet. 

    From an SCTomogramSet `tomo_set`, get public tomograms with
    `tomo_set.get_public_tomograms()`.

    Does not initially load the tomogram image data. Given a `Tomogram` called
    `tomo`, one can load and access the image data in one step with
    `tomo.get_data()`.

    Returns:
        SCTomogramSet containing annotated tomograms
    """
    # Collect all tomograms together into an SCTomogramSet.
    tomogram_set = SCTomogramSet()

    tomograms = [] # A temporary list to collect tomograms. To be placed into the tomogram_set later.

    ### PUBLIC POSITIVES ###
    print(f'\nLoading public positives.\n\tCurrent number of tomograms: {len(tomogram_set.tomograms)}\n')
    # ~~~ DRIVE 1 ~~~ #
    # Hylemonella
    root = f"/grphome/grp_tomo_db1_d1/nobackup/archive/TomoDB1_d1/FlagellarMotor_P1/Hylemonella gracilis"
    dir_regex = re.compile(r"yc\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^fm.mod$", re.IGNORECASE)
    tomogram_regex = re.compile(r".*\.rec$")

    these_tomograms = seek_annotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex], 
        ["Flagellar Motor"]
    )
    tomograms += these_tomograms

    # ~~~ DRIVE 2 ~~~ #
    # Legionella
    root = f"/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/legionella"
    dir_regex = re.compile(r"dg\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^FM\.mod$")
    tomogram_regex = re.compile(r".*SIRT_1k\.rec$")

    these_tomograms = seek_annotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex], 
        ["Flagellar Motor"]
    )
    tomograms += these_tomograms

    # Pseudomonas
    root = f"/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/Pseudomonasaeruginosa/done"
    dir_regex = re.compile(r"ab\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^FM\.mod$")
    tomogram_regex = re.compile(r".*SIRT_1k\.rec$")

    these_tomograms = seek_annotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex], 
        ["Flagellar Motor"]
    )
    tomograms += these_tomograms

    # Proteus_mirabilis
    root = f"/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/Proteus_mirabilis"
    dir_regex = re.compile(r"qya\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^FM\.mod$")
    tomogram_regex = re.compile(r".*\.rec$")

    these_tomograms = seek_annotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex], 
        ["Flagellar Motor"]
    )
    tomograms += these_tomograms

    # ~~~ DRIVE 3 ~~~ #
    # Bdellovibrio
    root = f"/grphome/grp_tomo_db1_d3/nobackup/archive/TomoDB1_d3/jhome_extra/Bdellovibrio_YW"
    dir_regex = re.compile(r"yc\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^flagellum_SIRT_1k\.mod$")
    tomogram_regex = re.compile(r".*SIRT_1k\.rec$")

    these_tomograms = seek_annotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex], 
        ["Flagellar Motor"]
    )
    tomograms += these_tomograms

    # Azospirillum
    root = f"/grphome/grp_tomo_db1_d3/nobackup/archive/TomoDB1_d3/jhome_extra/AzospirillumBrasilense/done"
    dir_regex = re.compile(r"ab\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^FM3\.mod$")
    tomogram_regex = re.compile(r".*SIRT_1k\.rec$")

    these_tomograms = seek_annotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex], 
        ["Flagellar Motor"]
    )
    tomograms += these_tomograms

    # Add tomograms to `tomogram_set` and reset the temporary collection list `tomograms`
    for tomo in tomograms:
        tomogram_set.append(tomo, private=False)
    tomograms = []

    print(f'Loading private positives.\n\tCurrent number of tomograms: {len(tomogram_set.tomograms)}\n')
    ### PRIVATE POSITIVES ###
    # ~~~ ZHIPING ~~~ #
    root = f"/grphome/fslg_imagseg/nobackup/archive/zhiping_data/caulo_WT/"
    dir_regex = re.compile(r"rrb\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^flagellum\.mod$")
    tomogram_regex = re.compile(r".*\.rec$")

    these_tomograms = seek_annotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex], 
        ["Flagellar Motor"]
    )
    tomograms += these_tomograms

    # ~~~ ANNOTATION PARTY ~~~ #
    root = f"/grphome/grp_tomo_db1_d4/nobackup/archive/ExperimentRuns/"
    dir_regex = re.compile(r"(sma\d{4}.*)|(Vibrio.*)")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"flagellar_motor\.mod")
    tomogram_regex = re.compile(r".*\.mrc$")

    these_tomograms = seek_annotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex], 
        ["Flagellar Motor"]
    )
    tomograms += these_tomograms

    # Add tomograms to `tomogram_set` and reset the temporary collection list `tomograms`
    for tomo in tomograms:
        tomogram_set.append(tomo, private=True)
    tomograms = []

    print(f'Loading public negatives.\n\tCurrent number of tomograms: {len(tomogram_set.tomograms)}\n')
    ### PUBLIC NEGATIVES ###
    # ~~~ DRIVE 1 ~~~ #
    # Hylemonella
    root = f"/grphome/grp_tomo_db1_d1/nobackup/archive/TomoDB1_d1/FlagellarMotor_P1/Hylemonella gracilis"
    dir_regex = re.compile(r"yc\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^fm.mod$", re.IGNORECASE)
    tomogram_regex = re.compile(r".*\.rec$")

    these_tomograms = seek_unannotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex]
    )
    tomograms += these_tomograms

    # ~~~ DRIVE 2 ~~~ #
    # Legionella
    root = f"/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/legionella"
    dir_regex = re.compile(r"dg\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^FM\.mod$")
    tomogram_regex = re.compile(r".*SIRT_1k\.rec$")

    these_tomograms = seek_unannotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex]
    )
    tomograms += these_tomograms

    # Pseudomonas
    root = f"/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/Pseudomonasaeruginosa/done"
    dir_regex = re.compile(r"ab\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^FM\.mod$")
    tomogram_regex = re.compile(r".*SIRT_1k\.rec$")

    these_tomograms = seek_unannotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex]
    )
    tomograms += these_tomograms

    # Proteus_mirabilis
    root = f"/grphome/grp_tomo_db1_d2/nobackup/archive/TomoDB1_d2/FlagellarMotor_P2/Proteus_mirabilis"
    dir_regex = re.compile(r"qya\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^FM\.mod$")
    tomogram_regex = re.compile(r".*\.rec$")

    these_tomograms = seek_unannotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex]
    )
    tomograms += these_tomograms

    # ~~~ DRIVE 3 ~~~ #
    # Bdellovibrio
    root = f"/grphome/grp_tomo_db1_d3/nobackup/archive/TomoDB1_d3/jhome_extra/Bdellovibrio_YW"
    dir_regex = re.compile(r"yc\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^flagellum_SIRT_1k\.mod$")
    tomogram_regex = re.compile(r".*SIRT_1k\.rec$")

    these_tomograms = seek_unannotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex]
    )
    tomograms += these_tomograms

    # Azospirillum
    root = f"/grphome/grp_tomo_db1_d3/nobackup/archive/TomoDB1_d3/jhome_extra/AzospirillumBrasilense/done"
    dir_regex = re.compile(r"ab\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^FM3\.mod$")
    tomogram_regex = re.compile(r".*SIRT_1k\.rec$")

    these_tomograms = seek_unannotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex]
    )
    tomograms += these_tomograms

    # Add tomograms to `tomogram_set` and reset the temporary collection list `tomograms`
    for tomo in tomograms:
        tomogram_set.append(tomo, private=False)
    tomograms = []

    # ~~~ NEGATIVES BRAXTON FOUND ON RANDY DATA ~~~ #
    root = f"/grphome/grp_tomo_db1_d3/nobackup/autodelete/negative_data"
    these_tomograms = [TomogramFile(os.path.join(root, path), load=False) for path in os.listdir(root)]
    tomograms += these_tomograms

    print(f'Loading private negatives.\n\tCurrent number of tomograms: {len(tomogram_set.tomograms)}\n')
    ### PRIVATE NEGATIVES ###
    # ~~~ ZHIPING ~~~ #
    root = f"/grphome/fslg_imagseg/nobackup/archive/zhiping_data/caulo_WT/"
    dir_regex = re.compile(r"rrb\d{4}.*")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"^flagellum\.mod$")
    tomogram_regex = re.compile(r".*\.rec$")

    these_tomograms = seek_unannotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex]
    )
    tomograms += these_tomograms

    # ~~~ ANNOTATION PARTY ~~~ #
    root = f"/grphome/grp_tomo_db1_d4/nobackup/archive/ExperimentRuns/"
    dir_regex = re.compile(r"(sma\d{4}.*)|(Vibrio.*)")
    directories = seek_dirs(root, dir_regex)

    flagellum_regex = re.compile(r"flagellar_motor\.mod")
    tomogram_regex = re.compile(r".*\.mrc$")

    these_tomograms = seek_unannotated_tomos(
        directories, 
        tomogram_regex, 
        [flagellum_regex]
    )
    tomograms += these_tomograms 

    # Add tomograms to `tomogram_set` and reset the temporary collection list `tomograms`
    for tomo in tomograms:
        tomogram_set.append(tomo, private=True)
    tomograms = []   

    print(f'Loading complete.\n\tCurrent number of tomograms: {len(tomogram_set.tomograms)}\n')

    # Return the completed set
    return tomogram_set

seek_annotated_tomos(directories, tomo_regex, annotation_regexes, annotation_names)

Collect pairs of tomogram files and their corresponding annotation files, without loading the tomograms. Expects one tomogram per directory.

Parameters:
  • directories (list of str) –

    List of directories to search for tomograms and annotations.

  • tomo_regex (Pattern) –

    The regex pattern to match tomogram filenames.

  • annotation_regexes (list of re.Pattern) –

    A list of regex patterns to match annotation filenames.

  • annotation_names (list of str) –

    A list of names for the annotations.

Returns:
  • List[TomogramFile]

    TomogramFile objects with their corresponding annotations.

Source code in tomogram_datasets/supercomputer_utils.py
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
def seek_annotated_tomos(
            directories: List[str], 
            tomo_regex: re.Pattern, 
            annotation_regexes: List[re.Pattern], 
            annotation_names: List[str]
        ) -> List[TomogramFile]:
    """
    Collect pairs of tomogram files and their corresponding annotation files,
    without loading the tomograms. Expects one tomogram per directory.

    Args:
        directories (list of str): List of directories to search for tomograms and annotations.

        tomo_regex (re.Pattern): The regex pattern to match tomogram filenames.

        annotation_regexes (list of re.Pattern): A list of regex patterns to match annotation filenames.

        annotation_names (list of str): A list of names for the annotations.

    Returns:
        TomogramFile objects with their corresponding annotations.
    """
    tomos = []
    for dir in directories:
        matches = seek_set(dir, [tomo_regex] + annotation_regexes)
        if matches is not None and None not in matches:
            tomogram_file = matches[0]
            annotation_files = matches[1:]
            annotations = []
            for (file, name) in zip(annotation_files, annotation_names):
                try:
                    annotations.append(AnnotationFile(file, name))
                except Exception as e:
                    print(f"An exception occured while loading `{file}`:\n{e}\n")
            tomo = TomogramFile(tomogram_file, annotations, load=False)
            tomos.append(tomo)
    return tomos

seek_dirs(root, regex, directories=None)

Search for directories matching the given regex recursively within the specified root directory.

Parameters:
  • root (str) –

    The root directory to start the search.

  • regex (Pattern) –

    The regex pattern to match the directory names.

  • directories (list, default: None ) –

    A list to accumulate matched directories. Should not be set in general usage, as this is used only for internal recursion. Defaults to None.

Returns:
  • Union[List[str], None]

    A list of paths of matching directories.

Source code in tomogram_datasets/supercomputer_utils.py
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
def seek_dirs(
            root: str, 
            regex: re.Pattern, 
            directories: Optional[List[str]] = None
        ) -> Union[List[str], None]:
    """Search for directories matching the given regex recursively within the
    specified root directory.

    Args:
        root (str): The root directory to start the search.

        regex (re.Pattern): The regex pattern to match the directory names.

        directories (list, optional): A list to accumulate matched directories. Should not be set in general usage, as this is used only for internal recursion. Defaults to None.

    Returns:
        A list of paths of matching directories.
    """
    if directories is None:
        directories = []
    for this_root, dirs, _ in os.walk(root):
        for dir in dirs:
            if regex.match(dir):
                directories.append(os.path.join(this_root, dir))
            else:
                directories = seek_dirs(dir, regex, directories)
    return directories

seek_file(directory, regex)

Search for a file matching the given regex recursively in the specified directory.

Parameters:
  • directory (str) –

    The root directory to start the search.

  • regex (Pattern) –

    The regex pattern to match the filenames.

Returns:
  • Union[str, None]

    The full path of the matching file, or None if no match is found.

Source code in tomogram_datasets/supercomputer_utils.py
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
def seek_file(directory: str, regex: re.Pattern) -> Union[str, None]:
    """Search for a file matching the given regex recursively in the specified
    directory.

    Args:
        directory (str): The root directory to start the search. 
        regex (re.Pattern): The regex pattern to match the filenames.

    Returns:
        The full path of the matching file, or None if no match is found.
    """
    for root, dirs, files in os.walk(directory):
        for file in files:
            if regex.match(file):
                return os.path.join(root, file)
        for dir in dirs:
            target = seek_file(dir, regex)
            if target is not None:
                return target
    return None

seek_files(directory, regex, files=None)

Search for all files matching the given regex recursively in the specified directory.

Parameters:
  • directory (str) –

    The root directory to start the search.

  • regex (Pattern) –

    The regex pattern to match the filenames.

  • files (list, default: None ) –

    A list to accumulate matched files. Should not be set in general usage, as this is used only for internal recursion. Defaults to None.

Returns:
  • List[str]

    A list of the full paths of each matching file.

Source code in tomogram_datasets/supercomputer_utils.py
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
def seek_files(
        directory: str, 
        regex: re.Pattern, 
        files: Optional[List[str]] = None
    ) -> List[str]:
    """Search for all files matching the given regex recursively in the specified
    directory.

    Args:
        directory (str): The root directory to start the search. 

        regex (re.Pattern): The regex pattern to match the filenames.

        files (list, optional): A list to accumulate matched files. Should not be set in general usage, as this is used only for internal recursion. Defaults to None.

    Returns:
        A list of the full paths of each matching file.
    """
    if files is None:
        files = []
    for root, dirs, dir_files in os.walk(directory):
        for dir_file in dir_files:
            if regex.match(dir_file):
                files.append(os.path.join(root, dir_file))
        for dir in dirs:
            files = seek_files(os.path.join(root, dir), regex, files)
    return files

seek_set(directory, regexes, matches=None)

Recursively search the specified directory for exactly one match for each regex in the list.

Parameters:
  • directory (str) –

    The directory to search.

  • regexes (list of re.Pattern) –

    A list of regex patterns to match filenames.

  • matches (list, default: None ) –

    A list to accumulate matches. Should not be set in general usage, as this is used only for internal recursion. Defaults to None.

Returns:
  • Union[List[str], None]

    A list of matching file paths or None if extra matches are found.

Source code in tomogram_datasets/supercomputer_utils.py
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
def seek_set(
            directory: str, 
            regexes: List[re.Pattern], 
            matches: List[str] = None
        ) -> Union[List[str], None]:
    """Recursively search the specified directory for exactly one match for each regex in the list.

    Args:
        directory (str): The directory to search.

        regexes (list of re.Pattern): A list of regex patterns to match filenames.

        matches (list, optional): A list to accumulate matches. Should not be set in general usage, as this is used only for internal recursion. Defaults to None.

    Returns:
        A list of matching file paths or None if extra matches are found.
    """
    if matches is None:
        matches = [None for _ in regexes]

    for root, dirs, files in os.walk(directory):
        for file in files:
            for r_idx, r in enumerate(regexes):
                if re.match(r, file):
                    if matches[r_idx] is None:
                        matches[r_idx] = os.path.join(root, file)
                    else:
                        return None  # Extra match found
    return matches

seek_unannotated_tomos(directories, tomo_regex, annotation_regexes)

Collect tomogram files that don't have annotations, without loading the tomograms.

Parameters:
  • directories (list of str) –

    List of directories to search for tomograms and annotations.

  • tomo_regex (Pattern) –

    The regex pattern to match tomogram filenames.

  • annotation_regexes (list of re.Pattern) –

    A list of regex patterns. If any of these patterns find a match for one of the files in a given directory in directories, the tomogram in that directory will not be saved and returned.

Returns:
Source code in tomogram_datasets/supercomputer_utils.py
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
def seek_unannotated_tomos(
            directories: List[str], 
            tomo_regex: re.Pattern, 
            annotation_regexes: List[re.Pattern], 
        ) -> List[TomogramFile]:
    """
    Collect tomogram files that don't have annotations, without loading the
    tomograms.

    Args:
        directories (list of str): List of directories to search for tomograms and annotations.

        tomo_regex (re.Pattern): The regex pattern to match tomogram filenames.

        annotation_regexes (list of re.Pattern): A list of regex patterns. If any of these patterns find a match for one of the files in a given directory in `directories`, the tomogram in that directory will not be saved and returned. 

    Returns:
        TomogramFile objects.
    """
    tomos = []
    for dir in directories:
        matches = seek_set(dir, [tomo_regex] + annotation_regexes)

        if matches is not None and None not in matches:
            # This tomogram is annotated
            continue
        else:
            # Ensure that there is a tomogram in this directory
            tomo_candidates = seek_files(dir, tomo_regex)
            n_candidates = len(tomo_candidates)
            # If there are multiple possible unannotated tomogram candidates or
            # none here, that's an issue.
            if n_candidates > 1:
                warnings.warn(f"Multiple ({n_candidates}) unannotated tomograms in {dir} found. This may mean that the regular expression used to seek tomograms is not specific enough, or that this directory is strange.")
                continue
            elif n_candidates == 0:
                warnings.warn(f"No tomograms found in {dir}.")
                continue
            # If there is one candidate, it isn't annotated.
            else:
                # Append what must be the only unannotated tomogram candidate
                tomogram_file = tomo_candidates[0]
                tomo = TomogramFile(tomogram_file, load=False)
                tomos.append(tomo) 
    return tomos