Flexible circuit mechanisms for context-dependent song sequencing


Fly strains and rearing

See Supplementary Tables 1 and 2.

Behavioural apparatus

Behavioural experiments were performed in two custom-made circular chambers (modified from ref. 13) within black acrylic enclosures. Ambient light was provided through an LED pad inside each enclosure (3.5′′ × 6′′ white, Metaphase Technologies). For each chamber, video was recorded at 60 fps (FLIR Blackfly S Mono 1.3 MP USB3 Vision ON Semi PYTHON 1300, BFS-U3-13Y3M-C, with TechSpec 25 mm C Series VIS-NIR fixed focal length lens) using the Motif recording system and API (loopbio GmbH), run via Python 2.7, and using infrared illumination of around 22 μW mm2 (Advanced Illumination High Performance Bright Field Ring Light, 6.0′′ O.D., wash down, IR LEDs, iC2, flying leads) and an infrared bandpass filter to block the red light used for optogenetics (Thorlabs premium bandpass filter; diameter 25 mm, central wavelength = 850 nm, full width at half maximum = 10 nm). Sound was recorded at 10 kHz from 16 particle velocity microphones (Knowles NR-23158-000) tiling the floor of each chamber. Microphones were hand-painted with IR absorbing dye to limit reflection artefacts in recorded videos (Epolin Spectre 160). Temperature was monitored inside each chamber using an analogue thermosensor (Adafruit TMP36).


Flies were kept for 3–5 days on regular fly food or food supplemented with all-trans retinal (ATR) at 1 ml ATR solution (100 mM in 95% ethanol) per 100 ml of food. ATR-fed flies were reared in the dark. CsChrimson was activated at 1−205 μW mm2, using 627-nm LEDs (Luxeon Star).

Behavioural assays

For all behavioural experiments, virgin males and virgin females were used 3–5 days after eclosion. Experiments were started within 120 min of the incubator lights turning on. Males and females were single and group housed, respectively. Flies were gently loaded into the behavioural chamber before an experiment, using a custom-made aspirator. Females were placed first for paired experiments. Chamber lids were painted with Sigmacote (SL2, Sigma-Aldrich) to prevent flies from walking on the ceiling, and kept under a fume hood to dry for at least 50 min before an experiment. Videos were manually scored for copulation. Data beyond copulation were excluded from analysis, unless statistical biases required exclusion of the entire recording.

Free courtship

Free courtship recordings were performed for 30 min, as previously described3.

Optogenetic neural activation

A fixed stimulus frequency of 1/8 Hz was used for optogenetic neural activation. Stimulus irradiance could take four distinct values (0, 1, 25 and 205 μW mm−2), spanning three orders of magnitude, and stimulus duty cycle could take five distinct values (1/64, 1/32, 1/16, 1/8, and 2/8), and both irradiance and duty cycle were combined in a full factorial design, resulting in 16 distinct blocks (pooling blocks with zero irradiance) that were presented in pseudo-randomized order for 120 s each.

Offline song segmentation

For subsequent offline analysis, song was segmented as previously described11,13, using a modified sine detection parameter to account for different acoustics in the setup used here (Params.pval = 1 × 10−7). For a given recording, the output of the song segmentation algorithm included information about the start and end of each bout and each sine train, as well as the centre of each detected pulse, and a snippet of noise not including song. To reduce the risk of contaminating bout statistics with artificially split bouts due to low amplitude of sine song (the softer song mode), we excluded all bouts containing sine song with amplitude below a chosen signal-to-noise (SNR) threshold. Specifically, we estimated the noise amplitude using the noise segment that is automatically detected and returned by the song segmentation software (thus not containing song), by first reducing the 16-dimensional (for 16 microphones) noise segment to a one-dimensional vector by storing the noise value of the loudest microphone at each time point, and then defining noise amplitude as the 99th percentile of the absolute value of the one-dimensional noise vector. Sine amplitude was calculated similarly, such that the SNR for a given sine bout was the ratio of the sine amplitude and the noise amplitude. We excluded bouts containing sine song with an SNR below 1.3 from further analysis. Furthermore, the song segmenter occasionally split individual sine trains, due to intermittent noise. Uncorrected, this could, for example, split a ‘psp’ bout into one ‘ps’ and one ‘sp’ bout very close in time. This allowed us to use a simple temporal threshold to merge such bouts if the inter-bout interval was below 0.5 s. The segmentation software is freely available at https://github.com/murthylab/MurthyLab_FlySongSegmenter.


Male and female poses (locations of head, thorax, and left and right wing tip) were automatically estimated and tracked, and manually proofread for all videos using SLEAP17 (sleap.ai).

Song behaviour analysis

Song probabilities

For experiments with open-loop optogenetic neural activation, the probability for a male to sing pulse or sine song at any point in time during a trial of a given stimulus block was computed as the fraction of trials containing pulse or sine song. For analyses separating song probabilities into far and near contexts, the average mfDist within a trial was thresholded to assign the trial to one of the two contexts. Song probabilities for each context were then calculated using only those trials assigned to that context.

Song sequences

Song segmentation provided information about the start and end of each bout, and all pulse and sine events within a bout, allowing to assign each bout a label describing the sequence of contained pulse and sine trains (‘p’ for a bout containing only pulse song, ‘spspspsp’ for a bout starting with sine song followed by several alternations between pulse and sine). For statistics, we reduced the amount of different bout types by abbreviating all bouts with one or more song alternations as ‘ps…’ or ‘sp…’ and referred to these as ‘complex p’ or ‘complex s’. ‘Acute’ and ‘persistent’ bouts were defined as bouts starting during a stimulus or after stimulus offset, respectively. Rebound song was defined as song that started after stimulus offset, in a bout that started during a stimulus (for example, if the initial pulse train in a ps bout starts during a stimulus, but the following sine train starts after stimulus offset, that is considered rebound sine).

Tap detector model

The tap detector model was constructed using a convolutional neural network. The convolutional neural network consisted of two two-dimensional convolutional layers followed by two fully connected layers. The two convolutional layers had 32 output and 64 output channels, respectively, a kernel size of 5 and a stride of 1. The outputs of each convolutional layer were passed through a rectified linear unit nonlinearity and a two-dimensional max pooling layer with a kernel size of two and stride of two. The first fully connected layer had 53,824 input and 32 output features followed by a rectified linear unit nonlinearity, and the second fully connected layer had 32 input and 2 output features corresponding to scores for a tap or non-tap. The model was trained using the AdamW algorithm for 100 epochs with a batch size of 16 and a learning rate of 0.0001. The model was constructed and trained using the PyTorch library54.

To train the convolutional neural network, video frames (size 128 × 128) of courting flies centred on the male were manually labelled as a tap or non-tap event using a custom graphical user interface. Ten videos were used for creating the tap dataset, with 12,606 manual annotations total. Of these annotated frames, 70% were used for training and 30% were held out for model validation. Receiver-operating characteristic analysis was performed on held out data to determine the relationship between model recall (true-positive rate) and fallout (false-positive rate) as a function of tap detection threshold.

Tap rate analysis

Tap rate was quantified as the number of taps within a song bout, divided by the duration of the bout (to compare with time before a bout, we used the number of taps within an equally sized window preceding the bout, divided by bout duration).

Tap-based model of P1a neural activity

We convolved the binary output of the tap detection network (tap = 1/no tap = 0, using a threshold on tap probability of P(tap) ≥ 0.9) with the known calcium fluorescence of P1 neurons in response to a single tap of the female abdomen (tap-triggered average27) to get an estimate of P1a neural activity in freely courting males on a moment-to-moment basis. We deconvolved the estimated calcium fluorescence signal with a kernel of the GCaMP6s calcium response (time constant of 2.6 s)55 to obtain an estimate of P1a rate, which we used for further analysis.

Bout-triggered analysis of tap rate

For a given recording, the binary tap detector output at video resolution was first upsampled to audio resolution, using the camera trigger signal for synchronization. For each song bout with leading pulse song (simple p or complex ps…), the number of detected taps occurring during the bout, nduring, was counted, and this was divided by the duration of the bout, B, to produce tap rate during the bout, Rduring = nduring/B. As a control, the tap rate before the bout was computed as the number of taps occurring in an equally sized time window B immediately preceding the bout, Rbefore = nbefore/B. Tap rates were averaged (using the mean) per animal across simple and complex bouts and used for further analysis.

Generalized linear model analysis

To estimate the relative predictive power of different sensory features on the choice of bout (here, complex versus simple p), we used the generalized linear modelling framework with a sparse before penalize non-predictive history weights, as previously described3,56. In brief, ten sensory features (male and female forward velocity (mFV and fFV), lateral speed (mLS and fLS), rotational speed (mRS and fRS), the angle of the male (female) thorax relative to the female (male) body axis (fmAngle and mfAngle), the distance between the male and female thorax (mfDist), and the instantaneous rate of P1a neurons estimated from detected taps (P1 rate)) were first smoothed using a moving average filter with a width of 20 video frames (0.33 s). Then, 21 uniformly distributed samples were extracted from the smoothed features within the 5 s of history leading up to the end of the first pulse train of each bout with the leading pulse song (for simple pulse bouts, this corresponded to the end of the bout). Extracted features were z-scored per feature, to account for different feature dimensions and scales. Inputs to the generalized linear model (GLM) were the transformed features and a corresponding binary vector indicating whether a given feature history corresponded to a simple or complex pulse bout, and outputs were estimated filters for each feature (providing information on which dynamics in the feature, within the history window, were most predictive for bout type) and the relative deviance reduction (a measure of model performance). To estimate fit robustness, we repeated GLM fitting 51 times, each time using 70% of the input data (sampled randomly without replacement). For each feature, the mean across fits and the mean absolute deviation from the mean across fits were calculated and used for display.

Two-photon functional imaging

We imaged the activity of Dsx+ cells in the VNC following pIP10 optogenetic activation using a custom-built two-photon laser scanning microscope57,58. Virgin male flies (5–8 days old) were mounted and dissected as previously described59, with minor differences. In brief, we positioned the fly ventral head and thorax side facing up to the underside of the dissection chamber, exposing both the ventral side of the central brain and the ventral side of the VNC. From the head, we removed the proboscis, surrounding cuticle, air sacks, tracheas, and additional fat or soft tissue. From the VNC, we removed thoracic tissue ventral to the VNC (for example, legs and cuticle), exposing the first and second segments of the VNC. Perfusion saline was continuously delivered to the meniscus between the objective and the dissection chamber throughout the experiment. We imaged Dsx+ TN1 cells (one hemisphere at a time), located in the ventral side of the second segment of the VNC. Specifically, although we used flies that express the calcium indicator GCaMP6s in all Dsx+ neurons, we only imaged the prothoracic and mesothoracic neuromeres, and the accessory mesothoracic neuropil of the VNC. Together, these regions house the Pr1–3, Pr4, Ms1–3 and TN1 cluster of neurons60, whose somas have distinct and identifiable locations. We manually segmented somas from these regions that, based on their anatomical location, were unambiguously identified as TN1 neurons. TN1 can be distinguished from dPR1 (which belongs to the Pr1–3 cluster) based on the position of the somas in the anteroposterior axis. Similarly, TN1 can be readily distinguished from its neighbouring clusters (Pr4 and Ms1–3) based on its more lateral and ventral location relative to the accessory mesothoracic neuropil, as well as the smaller size of its somas. Our manual segmentation was based on these criteria rather than on neural responses. We recorded 3–4 subvolumes of approximately 70 × 70 × 20 µm3 at a speed of 1 Hz (0.3 × 0.3 × 2 µm3 to 0.4 × 0.4 × 2 µm3 voxel size), covering the full ventral-to-dorsal extent of the TN1 cluster (~70 µm). Volumetric data were collected using ScanImage 2017 and processed using FlyCaIMan58 ( via Matlab 2018b. In brief, volumetric time series of the GCaMP6s signal was motion corrected in the xyz axes using the NoRMCorre algorithm61, and temporally resampled to correct for different slice timing across planes of the same volume and to align timestamps of volumes relative to the start of the optogenetic stimulation (linear interpolation). Subvolumes consecutively recorded along the z axis were stitched along the z axis using NoRMCorre. Dsx+ TN1 somas were segmented by using the constrained non-negative matrix factorization algorithm to obtain temporal traces and spatial footprints of each soma as implemented in CaImAn58,62 (the initial number and xyz location of all TN1 somas were manually pre-defined). For pIP10 activation, we used an optogenetic protocol that combined long stimuli driving strong pulse and weaker rebound sine when activating pIP10 in solitary, freely behaving males (Fig. 2c,e). Specifically, we used a stimulus of 2 s ON (at 13 µW mm−2 irradiance) and 2 s OFF repeated four times to maximize the magnitude of evoked GCaMP responses. Imaging started 10 s before stimulus onset, where baseline activity was measured, and lasted 10 s after stimulus offset.

Neural circuit model of song bout statistics

Network simulations were performed using the Brian2 package63 with Python3. Individual neurons were defined as variants of the Izhikevich model64 with known spiking properties (such as rebound or tonic spiking) that matched experimental predictions. In brief, the neuronal membrane potential v was modelled via three ordinary differential equations:

$$\frac\rmdv\rmdt=0.04v^2+5v+140-u+I+\fracg_e+g_i\tau _\rmsyn$$




$$\frac\rmdg_e,i\rmdt=-\fracg_e,i\tau _e,i,$$


with the membrane recovery variable u, the timescale a and the sensitivity b to subthreshold fluctuations of the membrane potential of the recovery variable, and the input current I. ge and gi are excitatory and inhibitory conductances, and τsyn is the synaptic time constant. Whenever the membrane potential reached 30 mV, this was considered an action potential and the membrane variables were reset via

$$v=c,\qquad u=u+\rmd.$$


The full song circuit model comprised four Izhikevich neurons, termed p (pulse), s (sine), pC2 and inh. Parameters a, b, c and d were chosen to enable post-inhibitory rebound dynamics for the pulse and sine node, and tonic spiking for the pC2 and inh nodes (Supplementary Table 3). Inhibitory connections were defined mutually between pulse and sine, from inh to both pulse and sine, and from pC2 to inh. A single excitatory connection was defined from pC2 to pulse. Together, pC2 provided excitatory input to the pulse node and functional inhibition to the inh node, mimicking the direct pulse pathway from pC2 via pIP10 to the VNC, and the proposed disinhibitory pathway from pC2 via P1a (activated for short mfDist and strong input to pC2; Fig. 5a), respectively. For each spike in a presynaptic neuron, the synaptic conductance ge,i was incremented by we,i. we,i were free parameters that were fit during genetic algorithm optimization. The remaining free parameters were the amount of tonic input current into the inh node (Itonic, regulating the amount of tonic inhibition onto the core pulse–sine circuit, mimicking the male’s default, unaroused, state), and a multiplicative factor Ie that controlled the gain of the sensory input current into pC2. The sensory input current into pC2 was the mfDist during a given recording of wild-type courtship, subjected to nonlinear (NL) transformation via

$$I_\rmpC2=I_e\cdot \,\rmNL(mfDist),$$


$$\,\rmNL(mfDist)\,=\frac\alpha 1+\exp \left(-\beta \cdot (x_0-\rmmfDist)\right),$$


to facilitate strong/weak input current to pC2 at short/large distance. Numerical simulations of the network were performed using Euler integration, and spike times of each node were recorded for further analysis. Specifically, ‘song sequences’ of the model were defined based on the activity of the pulse and sine node, such that a coherent spike train of one node that was at least 300 ms separated from the next spike of the other node was considered a simple bout, whereas alternating activity of the two nodes within 300 ms was considered a complex bout. This simplifying assumption allowed us to fit the model to experimental song statistics, using genetic algorithm optimization (see below). We did not explicitly model a mechanism to control bout duration, and we expect that additional features such as recurrent excitation in the pulse and sine nodes are required to sustain pulse or sine trains. All model parameters are specified in Supplementary Table 4.

Genetic algorithm optimization

The distribution of model bout types in response to a given naturalistic stimulus was directly comparable with the actual distribution of male song bouts corresponding to the sensory stimulus, which we exploited to fit the four free parameters of the model (a scalar gain factor for the input to the pC2 node, the strength of a constant input current to the inh node, and one global weight each for all excitatory and inhibitory connections) to the experimental data. Specifically, we used genetic algorithm optimization (the geneticalgorithm package in Python, to minimize the root-mean-squared difference between the experimental and simulated bout distribution (using six bout types, ‘p’, ‘ps’, ‘psp…’, ‘s’, ‘sp’ and ‘sps…’, to provide more information to the algorithm than when using the four categories ultimately used for analysis; this led to slightly better model fits), as well as the absolute difference between the number of experimental and simulated bouts (Δnbout), via the objective function root-mean-squared difference + 0.1  Δnbout (see Supplementary Table 4 for optimization parameters and ranges). The relative scaling of the two objectives was chosen to prioritize reproducing the bout distribution over the number of bouts. All genetic algorithm parameters are specified in Supplementary Table 4. Four hundred-second pieces of song data, randomly chosen from all 20 wild-type recordings with at least 10% of song bouts produced far from the female (mfDist > 4 mm), were used as input to the genetic algorithm.

Knockout simulations

To test the relevance of different computational features of the circuit model, we compared genetic algorithm fit performance for the full model (here using 200-s song snippets, randomly chosen from all wild-type recordings) to fit performance for versions of the model with individual computational features ‘knocked out’ or replaced. Specifically, although in the full model both the p and the s nodes were rebound excitable (by choosing the appropriate values for parameters a, b, c and d (see Supplementary Table 3), rebound excitability was knocked out in the pulse (no rebound pulse), sine (no rebound sine) or both nodes (no rebound) by adjusting parameters a, b, c and d (to turn these nodes from ‘rebound spiking’ into ‘tonic spiking’; see Supplementary Table 3). Disinhibition was knocked out by removing the inhibitory synapses of the inh node onto the pulse and sine nodes. To compare fits to experimental data for the default model comprising disinhibition and a model comprising excitatory modulation of the pulse and sine nodes, we replaced the inhibitory weights onto and from the inh node with excitatory weights, forming an excitatory node (‘exc’) for which we removed the tonic input that was present for the inh node in the disinhibitory model.

Irradiance measurements

Irradiance levels reported for optogenetic neural activation in freely behaving flies were measured (using a Thorlabs PM100D power meter) at the centre of the experimental chamber, with the chamber lid in place. Two identical experimental setups were used for behavioural experiments, and irradiance levels were calibrated to have uniform voltage-to-irradiance conversion across setups.

Irradiance reported for optogenetic stimuli during two-photon calcium imaging was measured (also using a Thorlabs PM100D power meter) at approximately the level of the preparation (after the objective).


Statistical analyses were performed either in Matlab 2019a or Python 3.7. The two-sided Wilcoxon rank-sum test (Mann–Whitney U-test) for equal medians was used for statistical group comparisons unless noted otherwise. Error bars indicate mean ± mean absolute deviation from the mean unless otherwise specified. Sample sizes were not predetermined but are similar to those reported in previous publications13,34. Experimenters were not blinded to the conditions of the experiments during data collection and analysis. Experimental groups were defined based on genotype, and data acquisition was randomized with respect to different genotypes. All attempts at replication were successful. For box plots, the central mark indicates the median, the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. Whiskers extend to 1.5 times the interquartile range away from the box edges.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.


Leave a Reply

Your email address will not be published. Required fields are marked *